用PHP获取网页的Meta信息

dly · 发表于 2012-10-13 22:26:03

一个网页的Meta信息虽然不能在页面中直接显示，但是对于搜索引擎来说还是很管用的，特别是几个重要的标签，比如Keyword和description等。接触过SEO的人都应该很明白。

在PHP中有现成的函数get_meta_tags可以直接获取到一个网页所有的meta信息。它的用法比较简单，如下所述：

<?php
//获取一个网页的meta信息
$url = "http://www.benxiaohai.com";
$meta = get_meta_tags($url);
//在这里还应该注意网页的编码问题，编码如果不一致，可以用iconv进行转码
print_r($meta);
//它将会返回一个标准的一维数组，如果meta信息为空，则返回一个空数组这个函数是很方便，但是有时候会得到我们想得到更多的信息，这时候就应该再多加一些变化来使用了。下面的一段代码是一个简单的搜索引擎。因为搜索引擎可能只需要meta信息中的keyword和decription信息，所以其它多于的都会过滤掉。
<?php
function get_meta_data($html) {
preg_match_all(
"|<meta[^>]+name=\"([^\"]*)\"[^>]+content=\"([^\"]*)\"[^>]+>|i", $html, $out,PREG_PATTERN_ORDER);
for ($i=0;$i < count($out[1]);$i++) {
//这里对这个数组进行遍历，取得自己想要的标签
if (strtolower($out[1][$i]) == "keywords") $meta['keywords'] = $out[2][$i];
if (strtolower($out[1][$i]) == "description") $meta['description'] = $out[2][$i];
}
return $meta;
}
?>

复制代码

接下来的这段代码是一个基于这个的应用

<?php
function getUrlData($url)
{
$result = false;
$contents = getUrlContents($url);
if (isset($contents) && is_string($contents))
{
$title = null;
$metaTags = null;
preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );
if (isset($match) && is_array($match) && count($match) > 0)
{
$title = strip_tags($match[1]);
}
preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 3)
{
$originals = $match[0];
$names = $match[1];
$values = $match[2];
if (count($originals) == count($names) && count($names) == count($values))
{
$metaTags = array();
for ($i=0, $limiti=count($names); $i < $limiti; $i++)
{
$metaTags[$names[$i]] = array (
'html' => htmlentities($originals[$i]),
'value' => $values[$i]
);
}
}
}
$result = array (
'title' => $title,
'metaTags' => $metaTags
);
}
return $result;
}
function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
$result = false;
$contents = @file_get_contents($url);
// Check if we need to go somewhere else
if (isset($contents) && is_string($contents))
{
preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
{
if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
{
return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
}
$result = false;
}
else
{
$result = $contents;
}
}
return $contents;
}
$result = getUrlData('http://www.benxiaohai.com');
echo '<pre>'; print_r($result); echo '</pre>';
/****************************************得到的结果如下*******************************
Array
(
[title] => 笨小孩心情驿站 - 老天爱笨小孩
[metaTags] => Array
(
[author] => Array
(
[html] => <meta name="author" content="ç¬¨å°�å©å¿�æ��é©¿ç«�" />
[value] => 笨小孩心情驿站
)
[description] => Array
(
[html] => <meta name="description" content="è��å¤©ç�±ç¬¨å°�å©" />
[value] => 老天爱笨小孩
)
[keywords] => Array
(
[html] => <meta name="keywords" content="ç¬¨å°�å©å¿�æ��é©¿ç«�" />
[value] => 笨小孩心情驿站
)
)
)
****************************************************************************/
?>

复制代码

再下面的这段代码，将根据各个标签的长度给出优化建议,可以稍加修改，做出更完美的

<?php
$url = "http://www.benxiaohai.com";
$result = get_meta_tags($url);
if(is_array($result)){
foreach($result as $key=>$value){
$key = strtolower($key);
switch ($key){
case "keywords":
$rs["keywords"] = $value;
break;
case "description":
$rs["description"]= $value;
break;
case "author":
$rs["author"] = $value;
break;
case "robots":
$rs["robots"] = $value;
break;
case "copyright":
$rs["copyright"] = $value;
break;
}
}
}else
{
echo "没有任何meta标签";
}
if(empty($rs["robots"])){
echo "没有Robots标签";
}
if(empty($rs["copyright"])){
echo "<br>页面没有任何版权信息";
}
if(empty($rs["author"])){
echo "<br>没有作者信息";
}
if(empty($rs["description"])){
echo "<br>没有页面描述信息";
}
if(empty($rs["keywords"])){
echo "<br>没有关键词信息";
}
?>

复制代码

用PHP获取网页的Meta信息

相关帖子