本教程仅做学习分享,请勿用于违法用途!

1.简介

​ 爱企查是百度推出的企业信息查询工具,可快速查询企业相关信息。在平时测试时,测试人员需要收集大量的企业信息,可以借助爬虫爬取爱企查网站实现。

​ 在挖通用型CNVD时 ,一般需要满足企业注册资金在5000w以上,本文演示爬取满足注册资金在5000w以上的企业信息。

2.分析思路

1.登录爱企查,筛选企业类型为

  • 注册资金在5000万以上
  • 资本类型为人民币
  • 企业状态为开业

2.进入网站后尝试从源码中取出广州市荔湾区华强电动设备行越秀分行

image.png

发现查询失败:
image.png

尝试f12抓包看请求
发现包内容过多。

3.选择去爬企业名称和注册资本

点击下方的页面索引:
image.png

查看抓到的包的访问payload
image.png

尝试多访问几页
image.png
发现payload基本一致
p:代表页数
f:代表筛选的数据信息

4.查看Response包(部分)

https://aiqicha.baidu.com/s/advanceSearchAjax?p=2&s=10&f=%7B%22regCap%22:[%7B%22start%22:5000,%22end%22:0%7D],%22regCapType%22:[%221%22],%22openStatus%22:[%22%E5%BC%80%E4%B8%9A%22]%7D&o=0
https://aiqicha.baidu.com/s/advanceSearchAjax?p=2&s=10&f=%7B%22regCap%22:[%7B%22start%22:5000,%22end%22:0%7D],%22regCapType%22:[%221%22],%22openStatus%22:[%22%E5%BC%80%E4%B8%9A%22]%7D&o=0

{
"status": 0,
"msg": "",
"data": {
"qType": 111,
"queryStr": "",
"pageNum": 3,
"resultList": [
{
"pid": "91770835364651",
"entName": "\u6d59\u6c5f\u4e49\u94ed\u5efa\u7b51\u52b3\u52a1\u6709\u9650\u516c\u53f8",
"entType": "\u6709\u9650\u8d23\u4efb\u516c\u53f8(\u81ea\u7136\u4eba\u6295\u8d44\u6216\u63a7\u80a1)",
"validityFrom": "2018-08-22",
"domicile": "\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u94b1\u5858\u65b0\u533a\u4e07\u4e9a\u540d\u57ce2\u5e62715\u5ba4",
"entLogo": "",
"openStatus": "\u5f00\u4e1a",
"legalPerson": "\u5415\u7956\u5f3a",
"tags": [],
"logoWord": "\u4e49\u94ed\u5efa\u7b51",
"hkLable": [],
"isHkComp": 0,
"isClaim": 0,
"titleName": "\u6d59\u6c5f\u4e49\u94ed\u5efa\u7b51\u52b3\u52a1\u6709\u9650\u516c\u53f8",
"titleLegal": "\u5415\u7956\u5f3a",
"titleDomicile": "\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u94b1\u5858\u65b0\u533a\u4e07\u4e9a\u540d\u57ce2\u5e62715\u5ba4",
"regCap": "1,011,000,000.0\u4e07",
"scope": "\u8bb8\u53ef\u9879\u76ee\uff1a\u623f\u5c4b\u5efa\u7b51\u548c\u5e02\u653f\u57fa\u7840\u8bbe\u65bd\u9879\u76ee\u5de5\u7a0b\u603b\u627f\u5305\uff1b\u5404\u7c7b\u5de5\u7a0b\u5efa\u8bbe\u6d3b\u52a8\uff1b\u7535\u6c14\u5b89\u88c5\u670d\u52a1\uff1b\u5efa\u7b51\u52b3\u52a1\u5206\u5305\uff1b\u4eba\u9632\u5de5\u7a0b\u9632\u62a4\u8bbe\u5907\u5236\u9020(\u4f9d\u6cd5\u987b\u7ecf\u6279\u51c6\u7684\u9879\u76ee\uff0c\u7ecf\u76f8\u5173\u90e8\u95e8\u6279\u51c6\u540e\u65b9\u53ef\u5f00\u5c55\u7ecf\u8425\u6d3b\u52a8\uff0c\u5177\u4f53\u7ecf\u8425\u9879\u76ee\u4ee5\u5ba1\u6279\u7ed3\u679c\u4e3a\u51c6)\u3002\u4e00\u822c\u9879\u76ee\uff1a\u5de5\u7a0b\u7ba1\u7406\u670d\u52a1\uff1b\u4f4f\u5b85\u6c34\u7535\u5b89\u88c5\u7ef4\u62a4\u670d\u52a1\uff1b\u516c\u8def\u6c34\u8fd0\u5de5\u7a0b\u8bd5\u9a8c\u68c0\u6d4b\u670d\u52a1\uff1b\u8f68\u9053\u4ea4\u901a\u4e13\u7528\u8bbe\u5907\u3001\u5173\u952e\u7cfb\u7edf\u53ca\u90e8\u4ef6\u9500\u552e\uff1b\u627f\u63a5\u603b\u516c\u53f8\u5de5\u7a0b\u5efa\u8bbe\u4e1a\u52a1\uff1b\u571f\u77f3\u65b9\u5de5\u7a0b\u65bd\u5de5\uff1b\u56ed\u6797\u7eff\u5316\u5de5\u7a0b\u65bd\u5de5\uff1b\u57ce\u5e02\u7eff\u5316\u7ba1\u7406\uff1b\u82b1\u5349\u7eff\u690d\u79df\u501f\u4e0e\u4ee3\u7ba1\u7406\uff1b\u9632\u8150\u6750\u6599\u9500\u552e\uff1b\u4e94\u91d1\u4ea7\u54c1\u96f6\u552e\uff1b\u4e94\u91d1\u4ea7\u54c1\u6279\u53d1\uff1b\u91d1\u5c5e\u6750\u6599\u9500\u552e\uff1b\u5efa\u7b51\u6750\u6599\u9500\u552e\uff1b\u6d82\u6599\u5236\u9020\uff08\u4e0d\u542b\u5371\u9669\u5316\u5b66\u54c1\uff09\uff1b\u91d1\u5c5e\u7ed3\u6784\u9500\u552e\uff1b\u5efa\u7b51\u7528\u94a2\u7b4b\u4ea7\u54c1\u9500\u552e\uff1b\u91d1\u5c5e\u7ed3\u6784\u5236\u9020(\u9664\u4f9d\u6cd5\u987b\u7ecf\u6279\u51c6\u7684\u9879\u76ee\u5916\uff0c\u51ed\u8425\u4e1a\u6267\u7167\u4f9d\u6cd5\u81ea\u4e3b\u5f00\u5c55\u7ecf\u8425\u6d3b\u52a8)\u3002",
"regNo": "91320322MA1X36PF3D",
"appJumpUrl": "aiqicha:\/\/open.app?params={\"naModule\":\"\/aqc\/detail\",\"naParam\":\"{\\\"pid\\\":\\\"91770835364651\\\"}\"}",
"labels": {
"opening": {
"text": "\u5f00\u4e1a",
"style": "blue",
"fontColor": "#1EA830",
"bgColor": "#EBF7EC"
}
},
"personTitle": "\u6cd5\u5b9a\u4ee3\u8868\u4eba",
"personId": "0fa179597fa8aaccf6f6ae44062f8a24",
"newLabels": [
{
"key": "opening",
"value": {
"text": "\u5f00\u4e1a",
"style": "blue",
"fontColor": "#1EA830",
"bgColor": "#EBF7EC"
}
}
]
},

选取部分字段如
“entName”: “\u946b\u6e90\u9e3f\u660a(\u5929\u6d25)\u79d1\u6280\u6709\u9650\u516c\u53f8”,
“regCap”: “1,011,000,000.0\u4e07”,

进行解密:
image.png

发现解密结果对应了页面所要抓取的信息
image.png

3.程序编写

1.保存访问请求头:

Accept:
application/json, text/plain, */*
Accept-Encoding:
gzip, deflate, br
Accept-Language:
zh-CN,zh;q=0.9
Connection:
keep-alive
Cookie: “抓到的cookie”
Host:
aiqicha.baidu.com
Referer:
https://aiqicha.baidu.com/advancesearch/list
Sec-Ch-Ua:
"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
empty
Sec-Fetch-Mode:
cors
Sec-Fetch-Site:
same-origin
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36
X-Requested-With:
XMLHttpRequest
Ymg_ssr:
1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre
Zx-Open-Url:
https://aiqicha.baidu.com/advancesearch/list

爬取代码前根据请求头制作headers

headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}

2.构造params参数如图所示

params = {
"p": "1",
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}

3.爬虫实现代码

#! /usr/bin/python
# -*- coding: UTF-8 -*-

import requests


def main():

params = {
"p": "1",
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
resultList = response.json()["data"]["resultList"]
for item in resultList:
print(item)
print(item['entName'])
pass

if __name__ == '__main__':
main()

4.成功爬取结果:
image.png

与browser访问一致
image.png

5.爬取内容后保存至文件中


def main(n):

#写入文件
f = open("./a.txt", "a")

params = {
"p": "1".
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
result = response.json()["data"]["resultList"]
for item in result:
a = "公司名称:"+item['entName']+' '+"注册资金:"+item['regCap']+' '+"公司类型:"+item['entType']+' '+"公司成立时间:"+item['validityFrom']+' '+"公司地址:"+item["domicile"]+' '+'开业状况:'+item["openStatus"]+'\n'
f.write(a)

f.close()
pass

image.png

6.爬取多页信息

#! /usr/bin/python
# -*- coding: UTF-8 -*-

import requests

def main(n):

#写入文件
f = open("./a.txt", "a")

m = str(n)
params = {
"p": m,
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
result = response.json()["data"]["resultList"]
for item in result:
a = "公司名称:"+item['entName']+' '+"注册资金:"+item['regCap']+' '+"公司类型:"+item['entType']+' '+"公司成立时间:"+item['validityFrom']+' '+"公司地址:"+item["domicile"]+' '+'开业状况:'+item["openStatus"]+'\n'
f.write(a)

f.close()
pass


if __name__ == '__main__':
main(1)
main(2)

成功爬取
image.png

运用for循环爬取多页

#! /usr/bin/python
# -*- coding: UTF-8 -*-

import requests

def main(n):

#写入文件
f = open("./a.txt", "a")

m = str(n)
params = {
"p": m,
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓取的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
result = response.json()["data"]["resultList"]
for item in result:
a = "公司名称:"+item['entName']+' '+"注册资金:"+item['regCap']+' '+"公司类型:"+item['entType']+' '+"公司成立时间:"+item['validityFrom']+' '+"公司地址:"+item["domicile"]+' '+'开业状况:'+item["openStatus"]+'\n'
f.write(a)

f.close()
pass




if __name__ == '__main__':
for i in range(1,5):
main(i)

image.png

4.多线程爬虫

1.普通爬虫

爬取爱企查500页数据

#! /usr/bin/python
# -*- coding: UTF-8 -*-

import requests
import time

def main(n):

#写入文件
f = open("./a.txt", "a")

m = str(n)
params = {
"p": m,
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686132225976_1686209020456_Krd2ws59qA1TALmBzJb0wdzDayRK14xrDbs2NsxGZghPtW2AaG8K4Now6K/zTfBecE9i1aegJL5KTT+cxrK6j9JK2ix9MoT8+qWljBqcj0i1PEU+RTVvgGjdlkPqamJydND6fQSkBidmHrZVhBXKSiqByC1knHzxmFJl0d+FYKOIK5Yue9P/KBLU2Q3FIF1YsrkslH8qYriHAg5MNXzRTlrU4gdLir4fXP++E+JwOUR2jkA8Mv4vTuUOG7crcKk2W+omKed78e4P6vi6j3qgdk6Pc4jel6aItYE8iTIAZCAcJNP4raFja3p6Eb/KAqre',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
result = response.json()["data"]["resultList"]
for item in result:
try:
a = "公司名称:"+item['entName']+' '+"注册资金:"+item['regCap']+' '+"公司类型:"+item['entType']+' '+"公司成立时间:"+item['validityFrom']+' '+"公司地址:"+item["domicile"]+' '+'开业状况:'+item["openStatus"]+'\n'
f.write(a)
except:
continue
f.close()
pass

if __name__ == '__main__':
start = time.time()
for i in range(1,501):
main(i)
end = time.time()
print("直接爬取耗费时间",(end-start))

成功爬取5000行数据
image.png

程序耗费时间
image.png

2.多线程爬虫

开10个线程跑

#! /usr/bin/python
# -*- coding: UTF-8 -*-

import requests
import time
from threading import Thread
from multiprocessing import Queue
def main(n):
#写入文件
f = open("./a.txt", "a")
m = str(n)
params = {
"p": m,
"s": "10",
'f': '{"regCap":[{"start":5000,"end":0}],"regCapType":["1"],"openStatus":["开业"]}',
"o": "0",
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": 'zh-CN,zh;q=0.9',
"Connection": "keep-alive",
"Cookie": '抓到的cookie',
'Host': 'aiqicha.baidu.com',
'Referer': 'https://aiqicha.baidu.com/advancesearch/list',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Ymg_ssr': '1686218663386_1686228429829_WB0FB0MHEc9Cck35D45asRQbQY/PoeVjFk/N0r6KlgJOBaP2AQeCctHp1+UF5PJ0r0oQ5oILGzmfycrw5JXeQkWKFeoFANP7yZEm/24Zn7bQjBOkalyujWkvRU8gYBnRfui6CEnMI6owfohQkqi47feLPDyQ304QODUtg1jr3lP0Yn+4K10FKN3G210hXv9FpwJz3ze/f6japRpezUodE+/Ac2i8kX4il0MtJjc6SRj1Smi+H0bM1xAH5/LthB1gHm0akZCD0pNskPl3oBpmNvRLcQEqf8D7heK+Krw+A1lkjfywECbMAzcktjLm9XLQHaSG+8O2W5p+F7LF0qTVHDcxw7nEkE8/Ix0zG5NnnPttJF1pvM4H4aOCoXnuARy8MuWogjpUrxI2JqjPd3Fjoz13c4usSIy1rQ5OO8BAe5syq+XIuX6+X2i9hmv4C7NwfkFPzUA0UF483F8KAgBFyVA+Pa2V8o95TjRbeCrwgLmp1q4h2tVtBtoplvnr8WKO3MG16+pHq7vcVaZzYSkxL0yIV9e0SKzgeCxmeIaFNZx2oCCtb4xND6+MRKe2VYOePXFuKYA6mG1MQ7/ZkvXz2IsG8t0dXDdZtmU0M8szb7HGxDYzXCjBXkbvmWuidBoI0xLbFSt8fBeQs2NWT0BUHwP0mRJ+52oDDTBobqYTJdZQ81bBVytZpVqeNU72/0rMnWhf2nbu0zBb4fuu5TdnClbECDrfzkzP5WQ94E5XeJSfEKw1HrmxjOKQhECNflwn8WhnkP32FDquj8e+0yLBADAVT5/dPyyeakElNGd4ZdTI10tszotkziWMyKg+qm2ST/NOpM2apFTWLxtaDePALLbwucfQ0E/aMdhYhztSbd7b28zaL0DYQEAcBwUuA4sGC3I/w63nTR00hi2n8awwTKNtpyJvvMjA1NmuwKWZvBrRLjrVwsYyFjTWVDX2cQa7u3/WHubLJo4uSuqNE3a1+FO3aGFwMZupfH7pCKA3LSRjgO829MQnzX5teielCpcywP933QFbMHbeqkn+zGXlDQ==',
'Zx-Open-Url': 'https://aiqicha.baidu.com/advancesearch/list'
}
url = "https://aiqicha.baidu.com/s/advanceSearchAjax"
response = requests.get(url, headers=headers, params=params)
try:
result = response.json()["data"]["resultList"]
for item in result:
a = "公司名称:"+item['entName']+' '+"注册资金:"+item['regCap']+' '+"公司类型:"+item['entType']+' '+"公司成立时间:"+item['validityFrom']+' '+"公司地址:"+item["domicile"]+' '+'开业状况:'+item["openStatus"]+'\n'
f.write(a)

except:
print()
pass

def run_reptile():
while not qq.empty():
i = qq.get()
main(i)
pass

if __name__ == '__main__':
qq = Queue()
for i in range(1,501):
qq.put(i)
thread_number = 10
start = time.time()
Threads = []

for i in range(thread_number):
t = Thread(target=run_reptile)
t.start()
Threads.append(t)
for t in Threads:
t.join()

end = time.time()
print("直接爬取耗费时间",(end-start))

image.png

5.参考文章

python爬虫 爬取爱企查公司信息_python 爱企查_代码永不报错的博客-CSDN博客
免责声明:本教程仅做学习分享,请勿用于违法用途!