特级做A爰片毛片免费69,永久免费AV无码不卡在线观看,国产精品无码av地址一,久久无码色综合中文字幕

<samp id="eec0m"><pre id="eec0m"></pre></samp><th id="eec0m"><nav id="eec0m"></nav></th>

<samp id="eec0m"></samp>

<samp id="eec0m"><tfoot id="eec0m"></tfoot></samp>

頻道

熱門(mén)頻道

用戶中心

豆知微信公眾號(hào)

微信二維碼

社會(huì)實(shí)踐報(bào)告范文大全

上傳

基于網(wǎng)頁(yè)的信息系統(tǒng)的一種預(yù)處理過(guò)程.doc

約57頁(yè)DOC格式手機(jī)打開(kāi)展開(kāi)

基于網(wǎng)頁(yè)的信息系統(tǒng)的一種預(yù)處理過(guò)程,57頁(yè)共計(jì)29240字摘要隨著web的迅速發(fā)展，web上的信息越來(lái)越豐富。web使用方便、信息豐富，人們?cè)絹?lái)越多的使用web來(lái)尋找需要的信息。為了更好的使用web上的信息，人們也不斷的追求能夠有效組織和利用網(wǎng)上信息的技術(shù)和系統(tǒng)。然而，web上的信息存在很多問(wèn)題：網(wǎng)頁(yè)內(nèi)的噪音內(nèi)容多、web上近似網(wǎng)頁(yè)量大以及缺乏必要的元數(shù)...
編號(hào):68-45075大小:1.25M
分類: 論文>計(jì)算機(jī)論文

內(nèi)容介紹

此文檔由會(huì)員 bfxqt 發(fā)布

57頁(yè)共計(jì)29240字

摘要
隨著Web的迅速發(fā)展，Web上的信息越來(lái)越豐富。Web使用方便、信息豐富，人們?cè)絹?lái)越多的使用Web來(lái)尋找需要的信息。為了更好的使用Web上的信息，人們也不斷的追求能夠有效組織和利用網(wǎng)上信息的技術(shù)和系統(tǒng)。然而，Web上的信息存在很多問(wèn)題：網(wǎng)頁(yè)內(nèi)的噪音內(nèi)容多、Web上近似網(wǎng)頁(yè)量大以及缺乏必要的元數(shù)據(jù)信息，這些問(wèn)題嚴(yán)重影響了Web信息系統(tǒng)的服務(wù)質(zhì)量。
針對(duì)Web信息系統(tǒng)的共性需求，本文提出了一個(gè)預(yù)處理框架及相應(yīng)的方法。該預(yù)處理框架包括了三個(gè)預(yù)處理工作：網(wǎng)頁(yè)凈化、近似網(wǎng)頁(yè)刪除和網(wǎng)頁(yè)元數(shù)據(jù)提取。通過(guò)預(yù)處理過(guò)程，原始網(wǎng)頁(yè)集中的近似網(wǎng)頁(yè)被刪除，而保留下來(lái)的網(wǎng)頁(yè)被凈化并轉(zhuǎn)化為一個(gè)統(tǒng)一的結(jié)構(gòu)化模型（稱之為DocView模型）。該模型中提供了各個(gè)領(lǐng)域需求較多的元數(shù)據(jù)和內(nèi)容數(shù)據(jù)，它包括網(wǎng)頁(yè)標(biāo)識(shí)、網(wǎng)頁(yè)類型、內(nèi)容類別、標(biāo)題、關(guān)鍵詞、摘要、正文、相關(guān)鏈接等元素。本文提出的預(yù)處理方法的一個(gè)重要優(yōu)點(diǎn)是它不需要除原始網(wǎng)頁(yè)以外的其他信息，而這些額外信息是該領(lǐng)域中其他方法所必須的；另一個(gè)優(yōu)點(diǎn)是將Web信息系統(tǒng)的共性需求放到一個(gè)過(guò)程中一次性提取出來(lái)，可以避免相同中間過(guò)程的重復(fù)執(zhí)行，從而提高信息提取效率。
本文中提出的預(yù)處理框架和方法已經(jīng)應(yīng)用到了“天網(wǎng)”搜索引擎和網(wǎng)頁(yè)自動(dòng)分類系統(tǒng)中。通過(guò)使用預(yù)處理后應(yīng)用系統(tǒng)質(zhì)量的提高，驗(yàn)證了該預(yù)處理方法的有效性。不難看出，通過(guò)這樣一個(gè)預(yù)處理過(guò)程，可以在任何一個(gè)網(wǎng)頁(yè)集上（包括World Wide Web）搭建一個(gè)組織良好的、凈化的、更易使用的信息層。

Abstract
With the rapid expansion of the Web, the content of the Web become richer and richer. People are increasingly using Web to find their wanted information because of the Web’s convenience and its abundance of information. In order to make better use of Web information, technologies that can automatically re-organize and manipulate web pages are pursued such as Web information retrieval, Web page classification and other Web mining work. However, there are many noises in the Web such as the noise content in the Web page (local noise) and near replica Web pages in the Web (global noise), which decrease the quality of the information on the Web, and consequently descrease the quality of the Web information systems seriously. Also, meta data of the Web pages are widely used in Web information systems, but they are not described explicitly. Some of these problems are never met in the traditional work.
In this thesis, we propose a new preprocessing framework and the corresponding approach to meet the common requirements of several typical web information systems. The framework includes three parts: Web page cleaning, replica removal and meta data extraction. After the preprocessing stage, redundant Web pages are deleted, then, reserved Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content and relevant hyperlinks. Most of them are meta data, while the latter two are content data. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work.
The preprocessing framework and approach have been applied to our search engine [TW] and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web information systems.

Keywords: World Wide Web, Data preprocessing, Data cleaning, Near replica detection, Meta data extraction

目錄

第1章引言 1
1.1 研究背景 1
1.2 本文研究?jī)?nèi)容 2
1.3 本文貢獻(xiàn) 3
1.4 本文組織 3
第2章相關(guān)研究 4
2.1 搜索引擎 4
2.2 網(wǎng)頁(yè)自動(dòng)分類 7
2.3 信息提取 9
2.4 元數(shù)據(jù)提取 10
第3章 Web信息系統(tǒng)面臨的問(wèn)題及共性需求 12
第4章預(yù)處理方法與技術(shù) 14
4.1 預(yù)處理框架及結(jié)果描述 14
4.1.1 預(yù)處理框架 14
4.1.2 預(yù)處理結(jié)果描述 14
4.2 網(wǎng)頁(yè)表示 15
4.2.1 網(wǎng)頁(yè)標(biāo)簽樹(shù)表示 16
4.2.2 網(wǎng)頁(yè)量化表示 19
4.3 網(wǎng)頁(yè)凈化 24
4.3.1 網(wǎng)頁(yè)類型判斷 24
4.3.2 主題網(wǎng)頁(yè)凈化 25
4.3.3 目錄網(wǎng)頁(yè)凈化 25
4.3.4 圖片網(wǎng)頁(yè)凈化 26
4.3.5 網(wǎng)頁(yè)凈化時(shí)空效率分析 26
4.4 近似網(wǎng)頁(yè)的發(fā)現(xiàn) 27
4.4.1 近似網(wǎng)頁(yè)發(fā)現(xiàn)算法 27
4.4.2 性能分析 29
4.5 網(wǎng)頁(yè)元數(shù)據(jù)提取 29
4.5.1 網(wǎng)頁(yè)元數(shù)據(jù)提取流程描述 30
4.5.2 正文提取 30
4.5.3 關(guān)鍵詞提取 30
4.5.4 內(nèi)容類別判斷 31
4.5.5 標(biāo)題提取 32
4.5.6 摘要提取 32
4.5.7 主題相關(guān)超鏈提取 33
4.6 本章小結(jié) 35
第5章應(yīng)用與評(píng)測(cè) 36
5.1 網(wǎng)頁(yè)凈化在網(wǎng)頁(yè)自動(dòng)分類系統(tǒng)中的應(yīng)用與評(píng)測(cè) 36
5.1.1 應(yīng)用 36
5.1.2 評(píng)測(cè)標(biāo)準(zhǔn) 37
5.1.3 評(píng)測(cè)結(jié)果與分析 37
5.2 近似網(wǎng)頁(yè)消除在搜索引擎中的應(yīng)用與評(píng)測(cè) 38
5.2.1 實(shí)驗(yàn)設(shè)計(jì) 38
5.2.2 評(píng)測(cè)標(biāo)準(zhǔn) 39
5.2.3 評(píng)測(cè)結(jié)果與分析 40
5.3 網(wǎng)頁(yè)元數(shù)據(jù)在搜索引擎的索引過(guò)程中的應(yīng)用與評(píng)測(cè) 41
5.3.1 檢索效率評(píng)測(cè) 41
5.3.2 檢索精度評(píng)測(cè) 42
5.4 本章小結(jié) 44
第6章總結(jié)與展望 45
6.1 總結(jié) 45
6.2 展望 45
參考資料 47

關(guān)鍵詞：萬(wàn)維網(wǎng), 數(shù)據(jù)預(yù)處理，數(shù)據(jù)凈化，近似網(wǎng)頁(yè)識(shí)別，元數(shù)據(jù)提取
參考資料
[ACMP] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001
[APE] Allison Woodruff, Paul M. Aoki, Eric Brewer, Paul Gauthier, and Lawrence A. Rowe. An Investigation of Documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, pages 963--979, Paris, France, May 1996.

[Fabrizio] Sebastinai Fabrizio. A tutorial on Automated text categorization.
[FSC] 馮是聰，中文網(wǎng)頁(yè)自動(dòng)分類技術(shù)研究及其在搜索引擎中的應(yīng)用，北京大學(xué)，博士學(xué)位研究生畢業(yè)論文。
[Google] Google Inc. http://www.google.com .
[HCB] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HCBG] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HD98] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998.
[HITS] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[HMC] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, May 1997.
[JW] Cowie, Jim and Lehnert, Wendy. Information Extraction. Communications of the ACM, January 1996/Vol. 39, No. 1, pp 80 – 91.
[LD] Lewis D et al. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298-306
[LG98] Steve Lawrence and C.Lee Giles. Searching the World Wide Web. Science, 280(5360): 98~100, Apr. 1998.
[LH02] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD, 2002.
[LS] L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of Computer Science and Technology, 17(1), January 2002.
[Manber94] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
[PR] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Ralph97] Grishman, Ralph. Information Extraction: Techniques and Challenges. Lecture Notes In Artificial Intelligence, Vol. 1299, pp 10 – 27, Springer-Verlag, Berlin Heidelberg, 1997. ISBN 3-540-63438-X
[SB] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
[SCAM] N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[SM99] N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.

TA們正在看...

相關(guān)文檔

幫助中心
呼吸機(jī)
幫助中心

官方微信

支付寶紅包

豆知網(wǎng) 教育科研學(xué)術(shù)文檔分享平臺(tái)

可信/實(shí)名雙認(rèn)證網(wǎng)站川公網(wǎng)安備 51010502011102號(hào)

豆知 . 豆知文庫(kù) 版權(quán)所有 - 2008-2025 蜀ICP備2023009049號(hào)-1

<samp id="esauo"><tfoot id="esauo"></tfoot></samp>

<blockquote id="esauo"></blockquote>