SupeSite中文分词/多关键词搜索方法
[文章作者:叶歆昊 最后修改:2010-07-19 转载请注明原文链接:http://littz.com/supesite-chinese-word-multiple-keywords-search-method.html]
受限于各站长使用的服务器原因,SupeSite很难有中文分词搜索(做大了做强效率高了接近搜索引擎)。近日,本人从张宴博客中了解其开发的基于HTTP协议的开源中文分词系统:HTTPCWS,突发灵感,将其用于SupeSite中文分词搜索。
现描述其工作原理:
1、本人用自己的服务器搭建了HTTPCWS,分词演示(HTTP GET方式):http://littz.com:1989/?w=尤其是,对于那些已经安装使用 Discuz! 和 UCenter Home 的站长来说,通过 SupeSite 7.0,马上就可以快速搭建一个社区门户,拥有一套简洁高效易用的社区资讯发布系统了。,返回的结构会被拆分成
尤其是 , 对于 那些 已经安装 使用 Discuz ! 和 UCenter Home 的 站长 来说 , 通过 SupeSite 7.0 , 马上 就可以 快速 搭建 一个 社区 门户 , 拥有 一套 简洁 高效 易用 的 社区 资讯 发布系统 了 。
注意传递一个整句,会被较准确地拆分为多个词组。
2、SupeSite的batch.search.php文件接收访问者搜索时传递的$searchkey变量。
3、将$searchkey传递给littz.com:1989服务器处理,返回拆分的词语。
4、替换所有中英文符号为空格,并将多余的连续空格去除。
5、匹配数据的搜索语句
SELECT * FROM `supe_spaceitems` WHERE `subject` LIKE '%词组1%词组2%词组3%' ORDER BY `dateline` DESC LIMIT 0,20
演示:http://littz.com,使用SupeSite 7.0系统,其中有篇文章的标题为“WordPress平滑迁移至SupeSite”,在没有分词的情况下如果访问用标题搜索“wordpress迁移至supesite”,搜索结果为空。使用分词搜索之后,访问者输入的“wordpress迁移至supesite”会被拆分为“wordpress” “迁移” “至” “supesite”这几个词。对应的数据搜索语句为
SELECT * FROM `supe_spaceitems` WHERE `subject` LIKE '%wordpress%迁移%至%supesite%' ORDER BY `dateline` DESC LIMIT 0,20
所以能够搜索到“Wordpress平滑迁移至SupeSite”这篇文章。
一、GBK版SupeSite的修改:
将下列代码插入到batch.search.php约132行
$urlplus = 'searchkey='.rawurlencode($searchkey).'&type='.rawurlencode($type);
下一行(不能改变顺序,否则将无法得到准确的分词结果)。
function clear_point($jiugui)
{
return str_replace(
array("~","!","@","#","$","%","^","&","*",",",".","?",";",":","/","'",'"',"
[","]","{","}","!","
¥","……","…","、",",","。","?",";",":","‘","“","”","’","
【","】","~","!","@","#","$","%","^","&","*",",",".","
<",">",";",":","'",""","[","]","{","}","/","\"," "),
array(' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '),$jiugui
);
}
$searchkey = urlencode($searchkey);
$searchkey = file_get_contents("http://littz.com:1989/?w=".$searchkey);
$searchkey = clear_point($searchkey);
$searchkey1 = preg_replace('/\s+/',' ',$searchkey);
$searchkey = str_replace(' ','%',$searchkey1);
一、UTF-8版SupeSite的修改:
将下列代码插入到batch.search.php约132行
$urlplus = 'searchkey='.rawurlencode($searchkey).'&type='.rawurlencode($type);
下一行(不能改变顺序,否则将无法得到准确的分词结果)。
function clear_point($jiugui)
{
return str_replace(
array("~","!","@","#","$","%","^","&","*",",",".","?",";",":","/","'",'"',"
[","]","{","}","!","
¥","……","…","、",",","。","?",";",":","‘","“","”","’","
【","】","~","!","@","#","$","%","^","&","*",",",".","
<",">",";",":","'",""","[","]","{","}","/","\"," "),
array(' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','
',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '),
$jiugui
);
}
$searchkey = iconv("UTF-8", "GBK//IGNORE", $searchkey);
$searchkey = urlencode($searchkey);
$searchkey = file_get_contents("http://littz.com:1989/?w=".$searchkey);
$searchkey = iconv("GBK", "UTF-8//IGNORE", $searchkey);
$searchkey = clear_point($searchkey);
$searchkey1 = preg_replace('/\s+/',' ',$searchkey);
$searchkey = str_replace(' ','%',$searchkey1);
因HTTPCWS只能接收GBK的分词,所以UTF-8的词汇需要转换成GBK分词之后再转回。
三、GBK和UTF-8均要做的修改。
默认模版,templates/default/site_search.html.php的56行附近,
<input type="text" class="input_tx" size="50" name="searchkey" value="$searchkey" /> 修改为 <input type="text" class="input_tx" size="50" name="searchkey" value="$searchkey1" />
附加说明:
分词搜索依赖与littz.com的服务器 以及 SS站点所在服务器 连接至littz.com服务器 的网络状况,littz.com服
务器在美国硅谷IP:64.71.167.26,受海底光缆影响,例如2009年8月17日的海底光缆故障就导致访问缓慢。HTTPCWS
接口本身的中文分词处理速度非常快,如果有条件的朋友建议自己搭建HTTPCWS +
Sphinx搜索服务器,本人不能保证此服务会长期有效运转,但肯定会尽量坚持,是提供一种解决问题的方法
2009/09/03 于 03:55:25
你那个httpwatch是破解的吗
2009/10/21 于 17:34:42
http://www.studenthome.cn/ 是你做的么? 还有瑞豪开源VPS 有没有更便宜的主机?
2009/10/21 于 20:27:51
学生之家网站是整个学生之家团队制作的,我主要负责。
瑞豪开源VPS你可以到他们的网站看看,再便宜的恐怕只是虚拟主机而不是VPS了