基于獨(dú)立用戶的聚類搜索引擎.rar
基于獨(dú)立用戶的聚類搜索引擎,2.3萬字 52頁包括開題報(bào)告,任務(wù)書,實(shí)習(xí)報(bào)告摘 要互聯(lián)網(wǎng)的迅速發(fā)展提供了越來越多的網(wǎng)絡(luò)信息,為了快速檢索到所需信息,搜索引擎成為不可或缺的網(wǎng)絡(luò)應(yīng)用工具之一。而現(xiàn)有的搜索引擎盡管采用了各種方法來提高檢索結(jié)果的精度,仍無法排除檢索結(jié)果中用戶查詢請求不相關(guān)的文檔,而且相關(guān)文檔和不相關(guān)文檔仍然相互...
該文檔為壓縮文件,包含的文件列表如下:
內(nèi)容介紹
原文檔由會員 usactu 發(fā)布
基于獨(dú)立用戶的聚類搜索引擎
2.3萬字 52頁
包括開題報(bào)告,任務(wù)書,實(shí)習(xí)報(bào)告
摘 要
互聯(lián)網(wǎng)的迅速發(fā)展提供了越來越多的網(wǎng)絡(luò)信息,為了快速檢索到所需信息,搜索引擎成為不可或缺的網(wǎng)絡(luò)應(yīng)用工具之一。而現(xiàn)有的搜索引擎盡管采用了各種方法來提高檢索結(jié)果的精度,仍無法排除檢索結(jié)果中用戶查詢請求不相關(guān)的文檔,而且相關(guān)文檔和不相關(guān)文檔仍然相互混雜,也給用戶帶來了額外負(fù)擔(dān)。
本文在對搜索引擎概況和聚類過程分析進(jìn)行介紹的基礎(chǔ)上,設(shè)計(jì)實(shí)現(xiàn)了一個(gè)基于聚類的獨(dú)立用戶搜索引擎,幫助Web用戶從搜索引擎所返回的大量文檔片斷中篩選出自己所需要的文檔,通過將搜索引擎返回的結(jié)果進(jìn)行聚類為若干個(gè)簇類,使得同一簇類內(nèi)文檔相關(guān)度盡可能的大,不同簇類間文檔相關(guān)度盡可能的小,從而大大縮小用戶所需瀏覽的結(jié)果數(shù)量,縮短用戶查詢所需要的時(shí)間。在系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)時(shí),對用戶的每次獨(dú)立搜索請求,我們都采用Yahoo提供的API接口來獲取研究所需的源數(shù)據(jù),采用倒排文件索引模型建立索引同時(shí)采用關(guān)鍵詞語的方法表征索引,根據(jù)檢索結(jié)果的標(biāo)題、URL和網(wǎng)頁摘要所含信息計(jì)算返回結(jié)果網(wǎng)頁之間的相似性,并將檢索結(jié)果以及它們之間的相似性關(guān)系映射到無向圖,最后根據(jù)無向圖中每個(gè)點(diǎn)的相似度進(jìn)行聚類最終得到結(jié)果。在聚類過程中,本文提出了一種新的聚類方法,該方法首先隨機(jī)選取幾個(gè)點(diǎn)作為初時(shí)質(zhì)心,然后依次計(jì)算剩余點(diǎn)與質(zhì)心的相似度并決定是否將其加入某個(gè)簇。如果大于某個(gè)閥值則將其加入該質(zhì)心所代表的類并調(diào)整質(zhì)心的位置,直至加入全部的點(diǎn)。理論分析表明,系統(tǒng)采用的倒排文件模型需要較少的資源,所提出的聚類算法在一定程度上解決了文檔多義性問題,同時(shí)系統(tǒng)對孤立點(diǎn)問題也進(jìn)行了相應(yīng)的處理。實(shí)驗(yàn)結(jié)果也表明,本文所提的聚類方法具有較好的效果,能對返回結(jié)果進(jìn)行有效聚類。
關(guān)鍵詞:搜索引擎,聚類,索引,相似度
Clustering Search Engine Based on Independent Users
Abstract
The rapid development of the Internet has provided more and more internet information. In order to search the required information rapidly, search engines have become one of the indispensable internet tools. Nowadays, although many search engine systems have been applying many methods to improve the retrieval precision, the retrieved results still include a lot of irrelevance documents mixing with the relevance ones, and it brings users an additional burden.
Based on the profile of search engines and cluster analysis, the clustering search engine for the independent users is designed. It helps the web users choose required information from along list of returned snippets, clustering of the retrieved results will assign snippets to automatically groups based on calculating the similarity. The groups (clusters) formed should have a high degree of association between members of the same groups and a low degree between members of different groups. So the users can only view their interested groups and save much time. In the process of design and implementation, when a independent search request is send to the search engine, the Yahoo API is used to obtain the necessary research data sources in our system and inverted files indexing model is used to create index while using the key words denoting index. The similarity is calculated according to the title of retuned results, URL and summary in our system, then search results and their relationship of similarity are mapped to an undirected graph. Finally, the points of the undirected map are clustered according to the similarity of each point in the undirected map and the ultimate results are gotten.
During the clustering process, a new clustering method is proposed. Firstly, the method randomly selects a few points as initial centers. Secondly, each point is gradually added one or several clusters based on re-calculation of the similarity between added point and each center. If the similarity is greater than predefined threshold, the point will be added some cluster, whose center are adjusted until adding all the points. Theoretic analysis shows that the used reverted file model in our system needs little resources and proposed clustering algorithm resolves the problem of . And the isolating point problem is also considered. Experimental results also show that the proposed method has better clustering performance and the returned results can be clustered effectively.
Key Words: Search Engine; Clustering; Index; Similarity
插圖索引
T圖2-1 搜索引擎的基本組成 5
T圖2-2 元搜索引擎結(jié)構(gòu)示意圖 11
T圖3-1 層次凝聚類示意圖 T17T
T
T圖4-1 聚類搜索引擎系統(tǒng)的模塊結(jié)構(gòu) 23
T圖4-2 使用相似度為權(quán)重的無相圖 32
T圖5-1 用戶登陸界面 36
T圖5-2 用戶搜索關(guān)鍵字顯示頁面 37
T圖5-3 點(diǎn)擊某一聚類欄顯示信息 38
附表索引
TU表UT4-1 文章1和2經(jīng)過倒排處理后的結(jié)果 26
TU表UT4-2 文章1和2經(jīng)過加強(qiáng)處理后的倒排結(jié)果 26
TU表UT5-1 單個(gè)詞匯與關(guān)鍵短語特征項(xiàng)對比表 35
TU
目 錄
摘 要.................................................................................................................................................I
ABSTRACT....................................................................................................................................II
插圖索引........................................................................................................................................IV
附表索引.........................................................................................................................................V
1. 緒 論 1
1.1 研究背景 1
1.2 研究概況 2
1.3 本文結(jié)構(gòu) 3
2. 搜索引擎概述 5
2.1 搜索引擎的組成 5
2.1.1 Robot 5
2.1.2 分析器 6
2.1.3 索引器 6
2.1.4 檢索器 6
2.1.5 用戶接口 7
2.2 搜索引擎工作流程 7
2.3 搜索引擎分類 8
2.3.1 全文搜索引擎 8
2.3.2 目錄索引搜索引擎 9
2.3.3 垂直搜索引擎 10
2.3.4 元搜索引擎 11
3. 聚類研究 13
3.1 文檔自動分類 13
3.2聚類分析 13
3.3 基本聚類方法 14
3.3.1 平面劃分方法 14
3.3.2 層次凝聚方法 16
3.4 網(wǎng)頁聚類算法 19
3.4.1基于網(wǎng)頁內(nèi)容的聚類算法 19
3.4.2基于鏈接分析的聚類算法 20
3.4.3基于用戶搜索日志的聚類算法 21
4. 聚類搜索引擎設(shè)計(jì) 23
4.1數(shù)據(jù)源預(yù)處理 23
4.2索引的建立 24
4.3相似度計(jì)算 28
4.4聚類處理 29
5.性能分析 35
5.1 理論分析 35
5.2 系統(tǒng)演示 36
總 結(jié) 39
致 謝 41
參考文獻(xiàn) 43
參考文獻(xiàn)
[5] 蘇新寧.信息檢索理論與技術(shù)[M].北京:科學(xué)技術(shù)文獻(xiàn)出版社.
[6] 李曉明,朱家稷,閆宏飛.互聯(lián)網(wǎng)上主題信息的一種收集與處理模型及其應(yīng)用[J] .計(jì)算機(jī)研究與發(fā)展
[7] 徐寶文,張衛(wèi)豐.搜索引擎與信息獲取技術(shù)[M].北京:清華大學(xué)出版社
[8] 徐家坤.搜索引擎的實(shí)用檢索技巧[J].科技情報(bào)開發(fā)與經(jīng)濟(jì).
[9]孟濤,閆宏飛,李曉明.一種評價(jià)搜索引擎信息覆蓋率的模型及其驗(yàn)證[J].電子學(xué)報(bào)
2.3萬字 52頁
包括開題報(bào)告,任務(wù)書,實(shí)習(xí)報(bào)告
摘 要
互聯(lián)網(wǎng)的迅速發(fā)展提供了越來越多的網(wǎng)絡(luò)信息,為了快速檢索到所需信息,搜索引擎成為不可或缺的網(wǎng)絡(luò)應(yīng)用工具之一。而現(xiàn)有的搜索引擎盡管采用了各種方法來提高檢索結(jié)果的精度,仍無法排除檢索結(jié)果中用戶查詢請求不相關(guān)的文檔,而且相關(guān)文檔和不相關(guān)文檔仍然相互混雜,也給用戶帶來了額外負(fù)擔(dān)。
本文在對搜索引擎概況和聚類過程分析進(jìn)行介紹的基礎(chǔ)上,設(shè)計(jì)實(shí)現(xiàn)了一個(gè)基于聚類的獨(dú)立用戶搜索引擎,幫助Web用戶從搜索引擎所返回的大量文檔片斷中篩選出自己所需要的文檔,通過將搜索引擎返回的結(jié)果進(jìn)行聚類為若干個(gè)簇類,使得同一簇類內(nèi)文檔相關(guān)度盡可能的大,不同簇類間文檔相關(guān)度盡可能的小,從而大大縮小用戶所需瀏覽的結(jié)果數(shù)量,縮短用戶查詢所需要的時(shí)間。在系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)時(shí),對用戶的每次獨(dú)立搜索請求,我們都采用Yahoo提供的API接口來獲取研究所需的源數(shù)據(jù),采用倒排文件索引模型建立索引同時(shí)采用關(guān)鍵詞語的方法表征索引,根據(jù)檢索結(jié)果的標(biāo)題、URL和網(wǎng)頁摘要所含信息計(jì)算返回結(jié)果網(wǎng)頁之間的相似性,并將檢索結(jié)果以及它們之間的相似性關(guān)系映射到無向圖,最后根據(jù)無向圖中每個(gè)點(diǎn)的相似度進(jìn)行聚類最終得到結(jié)果。在聚類過程中,本文提出了一種新的聚類方法,該方法首先隨機(jī)選取幾個(gè)點(diǎn)作為初時(shí)質(zhì)心,然后依次計(jì)算剩余點(diǎn)與質(zhì)心的相似度并決定是否將其加入某個(gè)簇。如果大于某個(gè)閥值則將其加入該質(zhì)心所代表的類并調(diào)整質(zhì)心的位置,直至加入全部的點(diǎn)。理論分析表明,系統(tǒng)采用的倒排文件模型需要較少的資源,所提出的聚類算法在一定程度上解決了文檔多義性問題,同時(shí)系統(tǒng)對孤立點(diǎn)問題也進(jìn)行了相應(yīng)的處理。實(shí)驗(yàn)結(jié)果也表明,本文所提的聚類方法具有較好的效果,能對返回結(jié)果進(jìn)行有效聚類。
關(guān)鍵詞:搜索引擎,聚類,索引,相似度
Clustering Search Engine Based on Independent Users
Abstract
The rapid development of the Internet has provided more and more internet information. In order to search the required information rapidly, search engines have become one of the indispensable internet tools. Nowadays, although many search engine systems have been applying many methods to improve the retrieval precision, the retrieved results still include a lot of irrelevance documents mixing with the relevance ones, and it brings users an additional burden.
Based on the profile of search engines and cluster analysis, the clustering search engine for the independent users is designed. It helps the web users choose required information from along list of returned snippets, clustering of the retrieved results will assign snippets to automatically groups based on calculating the similarity. The groups (clusters) formed should have a high degree of association between members of the same groups and a low degree between members of different groups. So the users can only view their interested groups and save much time. In the process of design and implementation, when a independent search request is send to the search engine, the Yahoo API is used to obtain the necessary research data sources in our system and inverted files indexing model is used to create index while using the key words denoting index. The similarity is calculated according to the title of retuned results, URL and summary in our system, then search results and their relationship of similarity are mapped to an undirected graph. Finally, the points of the undirected map are clustered according to the similarity of each point in the undirected map and the ultimate results are gotten.
During the clustering process, a new clustering method is proposed. Firstly, the method randomly selects a few points as initial centers. Secondly, each point is gradually added one or several clusters based on re-calculation of the similarity between added point and each center. If the similarity is greater than predefined threshold, the point will be added some cluster, whose center are adjusted until adding all the points. Theoretic analysis shows that the used reverted file model in our system needs little resources and proposed clustering algorithm resolves the problem of . And the isolating point problem is also considered. Experimental results also show that the proposed method has better clustering performance and the returned results can be clustered effectively.
Key Words: Search Engine; Clustering; Index; Similarity
插圖索引
T圖2-1 搜索引擎的基本組成 5
T圖2-2 元搜索引擎結(jié)構(gòu)示意圖 11
T圖3-1 層次凝聚類示意圖 T17T
T
T圖4-1 聚類搜索引擎系統(tǒng)的模塊結(jié)構(gòu) 23
T圖4-2 使用相似度為權(quán)重的無相圖 32
T圖5-1 用戶登陸界面 36
T圖5-2 用戶搜索關(guān)鍵字顯示頁面 37
T圖5-3 點(diǎn)擊某一聚類欄顯示信息 38
附表索引
TU表UT4-1 文章1和2經(jīng)過倒排處理后的結(jié)果 26
TU表UT4-2 文章1和2經(jīng)過加強(qiáng)處理后的倒排結(jié)果 26
TU表UT5-1 單個(gè)詞匯與關(guān)鍵短語特征項(xiàng)對比表 35
TU
目 錄
摘 要.................................................................................................................................................I
ABSTRACT....................................................................................................................................II
插圖索引........................................................................................................................................IV
附表索引.........................................................................................................................................V
1. 緒 論 1
1.1 研究背景 1
1.2 研究概況 2
1.3 本文結(jié)構(gòu) 3
2. 搜索引擎概述 5
2.1 搜索引擎的組成 5
2.1.1 Robot 5
2.1.2 分析器 6
2.1.3 索引器 6
2.1.4 檢索器 6
2.1.5 用戶接口 7
2.2 搜索引擎工作流程 7
2.3 搜索引擎分類 8
2.3.1 全文搜索引擎 8
2.3.2 目錄索引搜索引擎 9
2.3.3 垂直搜索引擎 10
2.3.4 元搜索引擎 11
3. 聚類研究 13
3.1 文檔自動分類 13
3.2聚類分析 13
3.3 基本聚類方法 14
3.3.1 平面劃分方法 14
3.3.2 層次凝聚方法 16
3.4 網(wǎng)頁聚類算法 19
3.4.1基于網(wǎng)頁內(nèi)容的聚類算法 19
3.4.2基于鏈接分析的聚類算法 20
3.4.3基于用戶搜索日志的聚類算法 21
4. 聚類搜索引擎設(shè)計(jì) 23
4.1數(shù)據(jù)源預(yù)處理 23
4.2索引的建立 24
4.3相似度計(jì)算 28
4.4聚類處理 29
5.性能分析 35
5.1 理論分析 35
5.2 系統(tǒng)演示 36
總 結(jié) 39
致 謝 41
參考文獻(xiàn) 43
參考文獻(xiàn)
[5] 蘇新寧.信息檢索理論與技術(shù)[M].北京:科學(xué)技術(shù)文獻(xiàn)出版社.
[6] 李曉明,朱家稷,閆宏飛.互聯(lián)網(wǎng)上主題信息的一種收集與處理模型及其應(yīng)用[J] .計(jì)算機(jī)研究與發(fā)展
[7] 徐寶文,張衛(wèi)豐.搜索引擎與信息獲取技術(shù)[M].北京:清華大學(xué)出版社
[8] 徐家坤.搜索引擎的實(shí)用檢索技巧[J].科技情報(bào)開發(fā)與經(jīng)濟(jì).
[9]孟濤,閆宏飛,李曉明.一種評價(jià)搜索引擎信息覆蓋率的模型及其驗(yàn)證[J].電子學(xué)報(bào)
TA們正在看...
- 小學(xué)畢業(yè)年級數(shù)學(xué)總復(fù)習(xí)計(jì)劃.doc
- 最新的星級酒店隆重開業(yè)領(lǐng)導(dǎo)優(yōu)秀的致辭合集.doc
- 小學(xué)春數(shù)學(xué)組教研工作計(jì)劃.doc
- 最新的春節(jié)公司董事長經(jīng)典的致辭.doc
- 小學(xué)一年級上冊數(shù)學(xué)教學(xué)工作總結(jié).doc
- 最新的春節(jié)敬老院慰問老人幽默致辭.doc
- 小學(xué)一年級上冊數(shù)學(xué)教學(xué)計(jì)劃.doc
- 最新的校園文化藝術(shù)節(jié)領(lǐng)導(dǎo)簡單的致辭.doc
- 小學(xué)一年級下冊數(shù)學(xué)教學(xué)工作總結(jié).doc
- 最新的校園文藝匯演閉幕式經(jīng)典致辭.doc