当前位置: 首页>前端>正文

通过json配置来爬取网站数据的爬虫工具

运行爬虫工具需要本地电脑上保证已安装docker,因为该工具是打包为了docker镜像。

1.下载并运行docker容器并指定本地文件路径的映射关系,冒号后面为本地所在的路径,下载好的数据会生成在该路径下(端口号也可以任意指定,只要跟你本地端口不冲突)

docker? run? --name? spidertool? -v? your_local_path:/data -d???windboy/spider_tool_common

2. 进入docker容器内

docker exec -it? spidertool? bash

3.输入python3

from spider_tool_common import?spider_tool_common as spidertool

params_json = {

"spider_name":"zj_jdggzy_bidding",

"loop_num":"10",

"start_index":"1",

"multi_factor":"25",

"pagelist_get_index":"",

"pagelist_groups_resolving":"//table[@class='GridView']//tr[@class='Row']",

"pagelist_url_resolving":".//a/@href",

"detailpage_fields_prvnce_name":"浙江省",

"detailpage_fields_latn_name":"杭州市",

"detailpage_fields_country_name":"建德市",

"detailpage_fields_inter_name":"杭州市公共资源交易中心建德分中心",

"detailpage_fields_table_names":"dict_winbidder_test_01",

"detailpage_fields_inter_type":"2",

}

spidertool.insert_params(params_json)? #写入xpath参数

spidertool.crawl_data()? ?#开始爬取

4.获得爬取结果

进入第二步指定的本地路径下会看到csv结尾的已经爬取数据


https://www.xamrdz.com/web/2p71848577.html

相关文章: