面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn).rar
面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn),1.4萬字 28頁包括開題報(bào)告和任務(wù)書摘 要網(wǎng)絡(luò)爬蟲是一個(gè)自動(dòng)下載網(wǎng)頁的程序,是搜索引擎的重要組成。傳統(tǒng)爬蟲從一個(gè)或若干初始網(wǎng)頁的url開始,獲得初始網(wǎng)頁上的url,在抓取網(wǎng)頁的過程中,不斷從當(dāng)前頁面上抽取新的url放入隊(duì)列,直到該url對(duì)列為空為止。本文設(shè)計(jì)的這款面向電子商務(wù)網(wǎng)...
該文檔為壓縮文件,包含的文件列表如下:
data:image/s3,"s3://crabby-images/05190/05190677f8737516af6ab12ecf7bb4b863a7240f" alt=""
data:image/s3,"s3://crabby-images/de5fa/de5faec55968253e113b3bedbff12f33b82a694b" alt=""
內(nèi)容介紹
原文檔由會(huì)員 心底的愛 發(fā)布
面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn)
1.4萬字 28頁
包括開題報(bào)告和任務(wù)書
摘 要
網(wǎng)絡(luò)爬蟲是一個(gè)自動(dòng)下載網(wǎng)頁的程序,是搜索引擎的重要組成。傳統(tǒng)爬蟲從一個(gè)或若干初始網(wǎng)頁的URL開始,獲得初始網(wǎng)頁上的URL,在抓取網(wǎng)頁的過程中,不斷從當(dāng)前頁面上抽取新的URL放入隊(duì)列,直到該URL對(duì)列為空為止。
本文設(shè)計(jì)的這款面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲,只對(duì)電子商務(wù)網(wǎng)站進(jìn)行信息搜索,讓用戶可以盡可能多的找到自己關(guān)心的商品信息。面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲的工作流程十分復(fù)雜,需要根據(jù)一定的網(wǎng)頁分析過濾與電子商務(wù)商品信息無關(guān)的鏈接,保留有用的鏈接并將其放入等待抓取的URL隊(duì)列。然后,它將根據(jù)一定的搜索策略從隊(duì)列中選擇下一步要抓取的網(wǎng)頁URL,并重復(fù)上述過程,直到達(dá)到保存URL的隊(duì)列為空為止。另外,所有被爬蟲抓取的網(wǎng)頁將會(huì)被系統(tǒng)存貯。
文章在分析網(wǎng)絡(luò)爬蟲的工作原理的基礎(chǔ)上,結(jié)合多線程技術(shù),設(shè)計(jì)了這個(gè)網(wǎng)絡(luò)爬蟲程序。
關(guān)鍵字:搜索引擎,網(wǎng)絡(luò)爬蟲,電子商務(wù)
The Topic-Specific Web Crawler of Oriented e-commerce website Design and Implementation
Abstract
Web Crawler is a procedure of automatically downloading website pages, it downloads website pages from the World Wide Web for search engine, and works as an important component of search engine. Traditional Web Crawler starts from one or several of the initial URL of a website, and get some new URLs from the website pages, in the process of continuously downloading website html pages, it finds some new URLs and determine which URLs will be added into a queue, it works until the URL Queue is empty.
The Web Crawler, which is designed by me, is to collect information on the e-commerce websites, so that users can find as much information as they concerned.
The Web Crawler which downloads e-commerce websites, has a very complicated workflow, and needs doing an analysis for the website and filter links which are unrelated to e-commerce website, then keeps the useful links and places them into the URL queue. Then, under certain searching strategy, it would choose the next URL from the queue to download the website page, and repeat this process until the URL queue is empty. In addition, all the pages are stored on the local driver.
Based on the analysis of the principle of the Web Crawler, and the multithreading technology, this article designs this Web Crawler procedure.
Key Words: Search engine, Web Crawler, E-commerce
目 錄
摘 要 I
ABSTRACT II
目 錄 III
1 緒論 4
1.1 課題背景及意義 4
1.2 國(guó)內(nèi)外研究現(xiàn)狀 2
1.3 爬蟲程序在電子商務(wù)的應(yīng)用 3
1.4 本文所要完成的工作 4
2 網(wǎng)絡(luò)爬蟲 5
2.1 搜索引擎概述 5
2.1.1 通用搜索引擎概述 5
2.1.2 專業(yè)搜索引擎介紹 5
2.1.3 搜索引擎的性能指標(biāo) 7
2.2 網(wǎng)絡(luò)爬蟲概述 9
2.2.1 網(wǎng)絡(luò)爬蟲簡(jiǎn)介 9
2.2.2 網(wǎng)絡(luò)爬蟲工作原理 9
3 專業(yè)網(wǎng)絡(luò)爬蟲的設(shè)計(jì) 10
3.1 爬蟲設(shè)計(jì)原理 10
3.2 線程技術(shù)的應(yīng)用 10
3.2.1 創(chuàng)建線程 10
3.2.2 線程間通信 11
3.3 網(wǎng)絡(luò)爬蟲結(jié)構(gòu)分析 11
3.3.1 如何解析HTML 11
3.3.2 Spider程序結(jié)構(gòu) 13
3.3.3 構(gòu)造Spider程序 15
3.3.4 URL篩選策略 18
3.4 運(yùn)行結(jié)果分析 18
結(jié)論 20
致謝 21
參考文獻(xiàn) 22
參考文獻(xiàn)
[1] 周立柱,林玲.聚焦爬蟲技術(shù)綜述[J].計(jì)算機(jī)應(yīng)用
[2] 李盛韜,余智華,程學(xué)旗,白碩.Web信息采集研究進(jìn)展[J].計(jì)算機(jī)科學(xué)
[3] 朱森良,邱瑜.移動(dòng)代理系統(tǒng)綜述[J].計(jì)算機(jī)研究與發(fā)展
[4] 馬曉星,呂建.分布式Web服務(wù)器技術(shù)綜述.計(jì)算機(jī)科學(xué)
[5] David W. Embley,Cui Tao. Automating the Extraction of Data from HTML Tables with Unknown Structure. Department of Computer Science[J
1.4萬字 28頁
包括開題報(bào)告和任務(wù)書
摘 要
網(wǎng)絡(luò)爬蟲是一個(gè)自動(dòng)下載網(wǎng)頁的程序,是搜索引擎的重要組成。傳統(tǒng)爬蟲從一個(gè)或若干初始網(wǎng)頁的URL開始,獲得初始網(wǎng)頁上的URL,在抓取網(wǎng)頁的過程中,不斷從當(dāng)前頁面上抽取新的URL放入隊(duì)列,直到該URL對(duì)列為空為止。
本文設(shè)計(jì)的這款面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲,只對(duì)電子商務(wù)網(wǎng)站進(jìn)行信息搜索,讓用戶可以盡可能多的找到自己關(guān)心的商品信息。面向電子商務(wù)網(wǎng)站的專業(yè)網(wǎng)絡(luò)爬蟲的工作流程十分復(fù)雜,需要根據(jù)一定的網(wǎng)頁分析過濾與電子商務(wù)商品信息無關(guān)的鏈接,保留有用的鏈接并將其放入等待抓取的URL隊(duì)列。然后,它將根據(jù)一定的搜索策略從隊(duì)列中選擇下一步要抓取的網(wǎng)頁URL,并重復(fù)上述過程,直到達(dá)到保存URL的隊(duì)列為空為止。另外,所有被爬蟲抓取的網(wǎng)頁將會(huì)被系統(tǒng)存貯。
文章在分析網(wǎng)絡(luò)爬蟲的工作原理的基礎(chǔ)上,結(jié)合多線程技術(shù),設(shè)計(jì)了這個(gè)網(wǎng)絡(luò)爬蟲程序。
關(guān)鍵字:搜索引擎,網(wǎng)絡(luò)爬蟲,電子商務(wù)
The Topic-Specific Web Crawler of Oriented e-commerce website Design and Implementation
Abstract
Web Crawler is a procedure of automatically downloading website pages, it downloads website pages from the World Wide Web for search engine, and works as an important component of search engine. Traditional Web Crawler starts from one or several of the initial URL of a website, and get some new URLs from the website pages, in the process of continuously downloading website html pages, it finds some new URLs and determine which URLs will be added into a queue, it works until the URL Queue is empty.
The Web Crawler, which is designed by me, is to collect information on the e-commerce websites, so that users can find as much information as they concerned.
The Web Crawler which downloads e-commerce websites, has a very complicated workflow, and needs doing an analysis for the website and filter links which are unrelated to e-commerce website, then keeps the useful links and places them into the URL queue. Then, under certain searching strategy, it would choose the next URL from the queue to download the website page, and repeat this process until the URL queue is empty. In addition, all the pages are stored on the local driver.
Based on the analysis of the principle of the Web Crawler, and the multithreading technology, this article designs this Web Crawler procedure.
Key Words: Search engine, Web Crawler, E-commerce
目 錄
摘 要 I
ABSTRACT II
目 錄 III
1 緒論 4
1.1 課題背景及意義 4
1.2 國(guó)內(nèi)外研究現(xiàn)狀 2
1.3 爬蟲程序在電子商務(wù)的應(yīng)用 3
1.4 本文所要完成的工作 4
2 網(wǎng)絡(luò)爬蟲 5
2.1 搜索引擎概述 5
2.1.1 通用搜索引擎概述 5
2.1.2 專業(yè)搜索引擎介紹 5
2.1.3 搜索引擎的性能指標(biāo) 7
2.2 網(wǎng)絡(luò)爬蟲概述 9
2.2.1 網(wǎng)絡(luò)爬蟲簡(jiǎn)介 9
2.2.2 網(wǎng)絡(luò)爬蟲工作原理 9
3 專業(yè)網(wǎng)絡(luò)爬蟲的設(shè)計(jì) 10
3.1 爬蟲設(shè)計(jì)原理 10
3.2 線程技術(shù)的應(yīng)用 10
3.2.1 創(chuàng)建線程 10
3.2.2 線程間通信 11
3.3 網(wǎng)絡(luò)爬蟲結(jié)構(gòu)分析 11
3.3.1 如何解析HTML 11
3.3.2 Spider程序結(jié)構(gòu) 13
3.3.3 構(gòu)造Spider程序 15
3.3.4 URL篩選策略 18
3.4 運(yùn)行結(jié)果分析 18
結(jié)論 20
致謝 21
參考文獻(xiàn) 22
參考文獻(xiàn)
[1] 周立柱,林玲.聚焦爬蟲技術(shù)綜述[J].計(jì)算機(jī)應(yīng)用
[2] 李盛韜,余智華,程學(xué)旗,白碩.Web信息采集研究進(jìn)展[J].計(jì)算機(jī)科學(xué)
[3] 朱森良,邱瑜.移動(dòng)代理系統(tǒng)綜述[J].計(jì)算機(jī)研究與發(fā)展
[4] 馬曉星,呂建.分布式Web服務(wù)器技術(shù)綜述.計(jì)算機(jī)科學(xué)
[5] David W. Embley,Cui Tao. Automating the Extraction of Data from HTML Tables with Unknown Structure. Department of Computer Science[J
TA們正在看...
- 2019年重慶成人高考高起點(diǎn)理化真題及答案.doc
- 2019年重慶成人高考高起點(diǎn)英語試題及答案.doc
- 2019年重慶成人高考高起點(diǎn)語文真題及答案.doc
- 2020年云南成人高考專升本醫(yī)學(xué)綜合真題及答案.doc
- 2020年云南成人高考專升本大學(xué)語文真題及答案.doc
- 2020年云南成人高考專升本政治真題及答案.doc
- 2020年云南成人高考專升本教育理論真題及答案.doc
- 2020年云南成人高考專升本民法真題及答案.doc
- 2020年云南成人高考專升本生態(tài)學(xué)基礎(chǔ)真題及答案.doc
- 2020年云南成人高考專升本藝術(shù)概論真題及答案.doc