基于lucene的站內(nèi)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)(含開題+任務(wù)書+ppt).rar
基于lucene的站內(nèi)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)(含開題+任務(wù)書+ppt),基于lucene的站內(nèi)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)1.6萬(wàn)字 41頁(yè)含開題報(bào)告+任務(wù)書+答辯ppt+論文正文+主要程序代碼摘 要lucene[1]是apache軟件基金會(huì)jakarta項(xiàng)目組的一個(gè)子項(xiàng)目,是一個(gè)開放源代碼的全文檢索引擎工具包,即它不是一個(gè)完整的全文檢索引擎,而是一個(gè)全文檢索引擎的架構(gòu),提供了完整的查詢引擎和索引...
該文檔為壓縮文件,包含的文件列表如下:
內(nèi)容介紹
原文檔由會(huì)員 usactu 發(fā)布
基于LUCENE的站內(nèi)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)
1.6萬(wàn)字 41頁(yè)
含開題報(bào)告+任務(wù)書+答辯PPT+論文正文+主要程序代碼
摘 要
LUCENE[1]是apache軟件基金會(huì)jakarta項(xiàng)目組的一個(gè)子項(xiàng)目,是一個(gè)開放源代碼的全文檢索引擎工具包,即它不是一個(gè)完整的全文檢索引擎,而是一個(gè)全文檢索引擎的架構(gòu),提供了完整的查詢引擎和索引引擎,LUCENE的目的是為軟件開發(fā)人員提供一個(gè)簡(jiǎn)單易用的工具包,以方便的在目標(biāo)系統(tǒng)中實(shí)現(xiàn)全文檢索的功能,或者是以此為基礎(chǔ)建立起完整的全文檢索引擎。
作為一個(gè)開放源代碼項(xiàng)目,LUCENE從問(wèn)世之后,引發(fā)了開放源代碼社群的巨大反響,程序員們不僅使用它構(gòu)建具體的全文檢索應(yīng)用,而且將之集成到各種系統(tǒng)軟件中去,以及構(gòu)建Web應(yīng)用,甚至某些商業(yè)軟件也采用了LUCENE作為其內(nèi)部全文檢索子系統(tǒng)的核心。apache軟件基金會(huì)的網(wǎng)站使用了LUCENE作為全文檢索的引擎,IBM的開源軟件eclipse的2.1版本中也采用了LUCENE作為幫助子系統(tǒng)的全文索引引擎,相應(yīng)的IBM的商業(yè)軟件Web Sphere中也采用了LUCENE。LUCENE以其開放源代碼的特性、優(yōu)異的索引結(jié)構(gòu)、良好的系統(tǒng)架構(gòu)獲得了越來(lái)越多的應(yīng)用。
這個(gè)系統(tǒng)的實(shí)際需求來(lái)源于本人在企業(yè)里實(shí)習(xí)時(shí)開發(fā)的2007世界特殊奧林匹克運(yùn)動(dòng)會(huì)官方網(wǎng)站,這個(gè)官方網(wǎng)站也使用了站內(nèi)搜索引擎的功能,這個(gè)搜索引擎是我用LUCENE在.NET平臺(tái)上實(shí)現(xiàn)的,現(xiàn)在網(wǎng)站運(yùn)行穩(wěn)定,站內(nèi)搜索使整個(gè)網(wǎng)站的功能更加強(qiáng)大,為用戶提供了更為便利的搜索功能。
本文對(duì)搜索引擎的原理、組成、數(shù)據(jù)結(jié)構(gòu)、工作流程等方面做了深入而細(xì)致地研究與分析。并且通過(guò)LUCENE來(lái)設(shè)計(jì)和實(shí)現(xiàn)一個(gè)全文檢索站內(nèi)搜索引擎系統(tǒng),最后通過(guò)增量索引和優(yōu)化索引兩個(gè)方面來(lái)說(shuō)明如何提高LUCENE的高效性。
關(guān)鍵詞: 全文檢索,搜索引擎,LUCENE ,jakarta
Design and Realization of Search Engine in Site Base On LUCENE
Abstract
LUCENE is a sub-project of jakarta project team in apache software foundation, and is a tool kit of full-text search engine of open source, that is not a complete full-text search engine, but is a framework of full-text search engine to provide integral inquire engine and index engine. LUCENE is designed to provide a simple, easy-to-use tool kit for software developers,and it’s convenient to realize the full-text retrieva l function in the target system , or as a basis to establish the integral full-text search engine.
As an open source project, LUCENE brings tremendous response in the open source community after its appearance. The programmers not only use it to build concrete full-text retrieva l applications, but also make it integrate into various of systems software, and build web applications, even some commercial software are using LUCENE as its core of full-text retrieva l subsystems. Apache Software Foundation website uses LUCENE as a full-text search engine, LUCENE is used to help full-text index engine of subsystem in the 2.1 version of IBM's software revenue eclipse. As the IBM’s corresponding commercial software Web Sphere also uses LUCENE. LUCENE gets more and more applications with open source characteristics, excellent index structure, good system architecture.
The actual requirements of this system origins from my development on “2007 Special Olympics World official website” when I practiced in the enterprise. The official website also uses the function of station search engine, I use the LUCENE to realize the search engine in .NET platform. Now for stable operation, station search make the function of the whole site become more powerful, and to provide users with a more convenient search function.
I have carefully studied and analysis search engines principle, composition, data structure, and work flow, and have designed and realized a full-text retrieva l stations search engine by means of LUCENE. Finally, both to I illustrate how to improve the efficiency of LUCENE through two aspects, the increment index and the optimization index.
Key Words: Full Text Retrieva l,Search Engine,jakarta
目 錄
1.緒論 1
1.1課題背景 1
1.2課題目前研究情況及存在問(wèn)題 2
1.3論文組織結(jié)構(gòu) 2
2.全文檢索與LUCENE 3
2.1全文檢索與全文檢索簡(jiǎn)介 3
2.2全文檢索系統(tǒng)與數(shù)據(jù)庫(kù)比較 4
2.3 LUCENE簡(jiǎn)介 8
2.4 LUCENE的應(yīng)用、特點(diǎn)及優(yōu)勢(shì) 9
2.5互聯(lián)網(wǎng)搜索引擎的研究 10
2.6中文分詞的簡(jiǎn)單介紹 11
3.LUCENE系統(tǒng)結(jié)構(gòu) 12
3.1 LUCENE系統(tǒng)結(jié)構(gòu)組織 12
3.2 數(shù)據(jù)流分析 12
3.3 LUCENE索引文件格式分析 14
3.3.1 LUCENE源碼實(shí)現(xiàn)分析的說(shuō)明 14
3.3.2 LUCENE索引文件格式 14
3.4 LUCENE的倒排序原理 17
3.5 LUCENE搜索結(jié)果排序 20
4.系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn) 21
4.1系統(tǒng)需求 21
4.2開發(fā)環(huán)境與工具 22
4.3系統(tǒng)組織結(jié)構(gòu) 24
4.4流程實(shí)現(xiàn) 25
4.4.1根據(jù)網(wǎng)站中不同模塊建立生成動(dòng)態(tài)索引 25
4.4.2搜索界面 26
4.4.3 搜索結(jié)果界面 27
5.關(guān)鍵技術(shù) 28
5.1 LUCENE增量索引 28
5.2優(yōu)化索引 28
5.3 LUCENE文件格式的通用性 29
5.4對(duì)于私密文件的處理 31
總 結(jié) 32
致 謝 33
參考文獻(xiàn) 34
附錄A 主要源程序 35
參考文獻(xiàn)
[8] 彭洪匯,林作銓.Internet上的搜索引擎和元搜索引擎[J].計(jì)算機(jī)科學(xué)
[9] 曹元大等.中文web文檔全文檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J].北京理工大學(xué)學(xué)報(bào)
[10] 顏維龍等.面向網(wǎng)絡(luò)的全文檢索中索引文件的組織[J].
計(jì)算機(jī)應(yīng)用研究
附錄A 主要源程序
1、 生成靜態(tài)索引
IntranetIndexer writer = new IntranetIndexer(@"E:UsingWebGenSIndexForEnIndex");
writer.AddDirectory(new DirectoryInfo(@"E:BackUpOutWebEnglish"), "*.aspx");
writer.AddDirectory(new DirectoryInfo(@"E:BackUpOutWebEnglish"), "*.html");
......
1.6萬(wàn)字 41頁(yè)
含開題報(bào)告+任務(wù)書+答辯PPT+論文正文+主要程序代碼
摘 要
LUCENE[1]是apache軟件基金會(huì)jakarta項(xiàng)目組的一個(gè)子項(xiàng)目,是一個(gè)開放源代碼的全文檢索引擎工具包,即它不是一個(gè)完整的全文檢索引擎,而是一個(gè)全文檢索引擎的架構(gòu),提供了完整的查詢引擎和索引引擎,LUCENE的目的是為軟件開發(fā)人員提供一個(gè)簡(jiǎn)單易用的工具包,以方便的在目標(biāo)系統(tǒng)中實(shí)現(xiàn)全文檢索的功能,或者是以此為基礎(chǔ)建立起完整的全文檢索引擎。
作為一個(gè)開放源代碼項(xiàng)目,LUCENE從問(wèn)世之后,引發(fā)了開放源代碼社群的巨大反響,程序員們不僅使用它構(gòu)建具體的全文檢索應(yīng)用,而且將之集成到各種系統(tǒng)軟件中去,以及構(gòu)建Web應(yīng)用,甚至某些商業(yè)軟件也采用了LUCENE作為其內(nèi)部全文檢索子系統(tǒng)的核心。apache軟件基金會(huì)的網(wǎng)站使用了LUCENE作為全文檢索的引擎,IBM的開源軟件eclipse的2.1版本中也采用了LUCENE作為幫助子系統(tǒng)的全文索引引擎,相應(yīng)的IBM的商業(yè)軟件Web Sphere中也采用了LUCENE。LUCENE以其開放源代碼的特性、優(yōu)異的索引結(jié)構(gòu)、良好的系統(tǒng)架構(gòu)獲得了越來(lái)越多的應(yīng)用。
這個(gè)系統(tǒng)的實(shí)際需求來(lái)源于本人在企業(yè)里實(shí)習(xí)時(shí)開發(fā)的2007世界特殊奧林匹克運(yùn)動(dòng)會(huì)官方網(wǎng)站,這個(gè)官方網(wǎng)站也使用了站內(nèi)搜索引擎的功能,這個(gè)搜索引擎是我用LUCENE在.NET平臺(tái)上實(shí)現(xiàn)的,現(xiàn)在網(wǎng)站運(yùn)行穩(wěn)定,站內(nèi)搜索使整個(gè)網(wǎng)站的功能更加強(qiáng)大,為用戶提供了更為便利的搜索功能。
本文對(duì)搜索引擎的原理、組成、數(shù)據(jù)結(jié)構(gòu)、工作流程等方面做了深入而細(xì)致地研究與分析。并且通過(guò)LUCENE來(lái)設(shè)計(jì)和實(shí)現(xiàn)一個(gè)全文檢索站內(nèi)搜索引擎系統(tǒng),最后通過(guò)增量索引和優(yōu)化索引兩個(gè)方面來(lái)說(shuō)明如何提高LUCENE的高效性。
關(guān)鍵詞: 全文檢索,搜索引擎,LUCENE ,jakarta
Design and Realization of Search Engine in Site Base On LUCENE
Abstract
LUCENE is a sub-project of jakarta project team in apache software foundation, and is a tool kit of full-text search engine of open source, that is not a complete full-text search engine, but is a framework of full-text search engine to provide integral inquire engine and index engine. LUCENE is designed to provide a simple, easy-to-use tool kit for software developers,and it’s convenient to realize the full-text retrieva l function in the target system , or as a basis to establish the integral full-text search engine.
As an open source project, LUCENE brings tremendous response in the open source community after its appearance. The programmers not only use it to build concrete full-text retrieva l applications, but also make it integrate into various of systems software, and build web applications, even some commercial software are using LUCENE as its core of full-text retrieva l subsystems. Apache Software Foundation website uses LUCENE as a full-text search engine, LUCENE is used to help full-text index engine of subsystem in the 2.1 version of IBM's software revenue eclipse. As the IBM’s corresponding commercial software Web Sphere also uses LUCENE. LUCENE gets more and more applications with open source characteristics, excellent index structure, good system architecture.
The actual requirements of this system origins from my development on “2007 Special Olympics World official website” when I practiced in the enterprise. The official website also uses the function of station search engine, I use the LUCENE to realize the search engine in .NET platform. Now for stable operation, station search make the function of the whole site become more powerful, and to provide users with a more convenient search function.
I have carefully studied and analysis search engines principle, composition, data structure, and work flow, and have designed and realized a full-text retrieva l stations search engine by means of LUCENE. Finally, both to I illustrate how to improve the efficiency of LUCENE through two aspects, the increment index and the optimization index.
Key Words: Full Text Retrieva l,Search Engine,jakarta
目 錄
1.緒論 1
1.1課題背景 1
1.2課題目前研究情況及存在問(wèn)題 2
1.3論文組織結(jié)構(gòu) 2
2.全文檢索與LUCENE 3
2.1全文檢索與全文檢索簡(jiǎn)介 3
2.2全文檢索系統(tǒng)與數(shù)據(jù)庫(kù)比較 4
2.3 LUCENE簡(jiǎn)介 8
2.4 LUCENE的應(yīng)用、特點(diǎn)及優(yōu)勢(shì) 9
2.5互聯(lián)網(wǎng)搜索引擎的研究 10
2.6中文分詞的簡(jiǎn)單介紹 11
3.LUCENE系統(tǒng)結(jié)構(gòu) 12
3.1 LUCENE系統(tǒng)結(jié)構(gòu)組織 12
3.2 數(shù)據(jù)流分析 12
3.3 LUCENE索引文件格式分析 14
3.3.1 LUCENE源碼實(shí)現(xiàn)分析的說(shuō)明 14
3.3.2 LUCENE索引文件格式 14
3.4 LUCENE的倒排序原理 17
3.5 LUCENE搜索結(jié)果排序 20
4.系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn) 21
4.1系統(tǒng)需求 21
4.2開發(fā)環(huán)境與工具 22
4.3系統(tǒng)組織結(jié)構(gòu) 24
4.4流程實(shí)現(xiàn) 25
4.4.1根據(jù)網(wǎng)站中不同模塊建立生成動(dòng)態(tài)索引 25
4.4.2搜索界面 26
4.4.3 搜索結(jié)果界面 27
5.關(guān)鍵技術(shù) 28
5.1 LUCENE增量索引 28
5.2優(yōu)化索引 28
5.3 LUCENE文件格式的通用性 29
5.4對(duì)于私密文件的處理 31
總 結(jié) 32
致 謝 33
參考文獻(xiàn) 34
附錄A 主要源程序 35
參考文獻(xiàn)
[8] 彭洪匯,林作銓.Internet上的搜索引擎和元搜索引擎[J].計(jì)算機(jī)科學(xué)
[9] 曹元大等.中文web文檔全文檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J].北京理工大學(xué)學(xué)報(bào)
[10] 顏維龍等.面向網(wǎng)絡(luò)的全文檢索中索引文件的組織[J].
計(jì)算機(jī)應(yīng)用研究
附錄A 主要源程序
1、 生成靜態(tài)索引
IntranetIndexer writer = new IntranetIndexer(@"E:UsingWebGenSIndexForEnIndex");
writer.AddDirectory(new DirectoryInfo(@"E:BackUpOutWebEnglish"), "*.aspx");
writer.AddDirectory(new DirectoryInfo(@"E:BackUpOutWebEnglish"), "*.html");
......