Browse Source

New videos

pull/108/head
NaiboWang-Alienware 1 year ago
parent
commit
e54e25172a
17 changed files with 1622 additions and 184 deletions
  1. +1
    -1
      .temp_to_pub/EasySpider_Linux_x64/软件使用说明.txt
  2. +1
    -1
      .temp_to_pub/EasySpider_MacOS_all_arch/软件使用说明.txt
  3. +1
    -1
      .temp_to_pub/EasySpider_windows_x64/软件使用说明.txt
  4. +1
    -1
      .temp_to_pub/EasySpider_windows_x86/软件使用说明.txt
  5. +1089
    -145
      ElectronJS/tasks/115.json
  6. +1
    -0
      ElectronJS/tasks/116.json
  7. +1
    -0
      ElectronJS/tasks/117.json
  8. +379
    -0
      ElectronJS/tasks/118.json
  9. +1
    -0
      ElectronJS/tasks/119.json
  10. +1
    -0
      ElectronJS/tasks/120.json
  11. +1
    -0
      ElectronJS/tasks/121.json
  12. +1
    -0
      ElectronJS/tasks/122.json
  13. +1
    -0
      ElectronJS/tasks/123.json
  14. +1
    -1
      ExecuteStage/.vscode/launch.json
  15. +64
    -33
      ExecuteStage/easyspider_executestage.py
  16. +75
    -0
      ExecuteStage/myChrome.py
  17. +3
    -1
      Readme.md

+ 1
- 1
.temp_to_pub/EasySpider_Linux_x64/软件使用说明.txt View File

@ -8,7 +8,7 @@
支持Ubuntu 20.04, Debian, Deepin x64及以上版本。
视频教程:https://www.bilibili.com/video/BV1Fk4y1L7xX/
视频教程:https://www.bilibili.com/video/BV1th411A7ey/
可以从其他机器导入任务,只需要把其他机器的tasks文件夹里的.json文件放入此目录的tasks文件夹里即可。同理执行号文件可以通过复制execution_instances文件夹中的.json文件来导入。注意,两个文件夹里的.json文件只支持命名为大于0的数字。

+ 1
- 1
.temp_to_pub/EasySpider_MacOS_all_arch/软件使用说明.txt View File

@ -6,7 +6,7 @@
10.x版本MacOS请下载v0.2.0版本使用。
视频教程:https://www.bilibili.com/video/BV1Fk4y1L7xX/
视频教程:https://www.bilibili.com/video/BV1th411A7ey/
可以从其他机器导入任务,只需要把其他机器的tasks文件夹里的.json文件放入/Users/你的用户名/Library/Application Support/EasySpider/tasks文件夹里即可。同理执行号文件可以通过复制execution_instances文件夹中的.json文件来导入。注意,两个文件夹里的.json文件只支持命名为大于0的数字。

+ 1
- 1
.temp_to_pub/EasySpider_windows_x64/软件使用说明.txt View File

@ -6,7 +6,7 @@
Windows 7此版本无直接可用版本(因为Chrome 109是最后一个支持Windows 7的Chrome版本),但v0.2.0的32位版本可用,且可以通过自行编译软件来运行,因此如想使用Windows 7采集数据,请下载v0.2.0的32位版本或自行下载代码并编译:https://github.com/NaiboWang/EasySpider/releases/tag/v0.2.0
视频教程:https://www.bilibili.com/video/BV1Fk4y1L7xX/
视频教程:https://www.bilibili.com/video/BV1th411A7ey/
这个软件绝对不是特洛伊木马/病毒!如果被像Windows Defender这样的杀毒软件误认为是病毒,请进行恢复,或者打开“EasySpider.bat”来运行我们的软件。

+ 1
- 1
.temp_to_pub/EasySpider_windows_x86/软件使用说明.txt View File

@ -6,7 +6,7 @@
Windows 7此版本无直接可用版本(因为Chrome 109是最后一个支持Windows 7的Chrome版本),但v0.2.0的32位版本可用,且可以通过自行编译软件来运行,因此如想使用Windows 7采集数据,请下载v0.2.0的32位版本或自行下载代码并编译:https://github.com/NaiboWang/EasySpider/releases/tag/v0.2.0
视频教程:https://www.bilibili.com/video/BV1Fk4y1L7xX/
视频教程:https://www.bilibili.com/video/BV1th411A7ey/
这个软件绝对不是特洛伊木马/病毒!如果被像Windows Defender这样的杀毒软件误认为是病毒,请进行恢复,或者打开“EasySpider.bat”来运行我们的软件。

+ 1089
- 145
ElectronJS/tasks/115.json
File diff suppressed because it is too large
View File


+ 1
- 0
ElectronJS/tasks/116.json View File

@ -0,0 +1 @@
{"id":116,"name":"iP地址查询--手机号码查询归属地 | 邮政编码查询 | iP地址归属地查询 | 身份证号码验证在线查询网","url":"https://www.ip138.com","links":"https://www.ip138.com","create_time":"7/4/2023, 8:21:10 AM","version":"0.3.5","saveThreshold":10,"cloudflare":0,"environment":0,"containJudge":false,"desc":"https://www.ip138.com","inputParameters":[{"id":0,"name":"urlList_0","nodeId":1,"nodeName":"打开网页","value":"https://www.ip138.com","desc":"要采集的网址列表,多行以\\n分开","type":"string","exampleValue":"https://www.ip138.com"}],"outputParameters":[{"id":0,"name":"参数1_文本","desc":"","type":"string","exampleValue":"502BadGateway"},{"id":1,"name":"参数2_文本","desc":"","type":"string","exampleValue":"nginx"}],"graph":[{"index":0,"id":0,"parentId":0,"type":-1,"option":0,"title":"root","sequence":[1,2,3],"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","wait":0},"isInLoop":false},{"id":1,"index":1,"parentId":0,"type":0,"option":1,"title":"打开网页","sequence":[],"isInLoop":false,"position":0,"parameters":{"useLoop":false,"xpath":"","wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"url":"https://www.ip138.com","links":"https://www.ip138.com","maxWaitTime":10,"scrollType":0,"scrollCount":1,"scrollWaitTime":1}},{"id":2,"index":2,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":4,"tabIndex":-1,"useLoop":false,"xpath":"/html/body/p[1]/a[1]","iframe":true,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/p[1]/a[1]","//a[contains(., '220.255.29')]","/html/body/p[last()-2]/a[last()-1]"]}},{"id":3,"index":3,"parentId":0,"type":1,"option":8,"title":"循环","sequence":[4],"isInLoop":false,"position":2,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"/html/body/center","iframe":true,"wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"loopType":1,"pathList":"","textList":"","code":"","waitTime":0,"exitCount":0,"historyWait":2,"breakMode":0,"breakCode":"","breakCodeWaitTime":0,"allXPaths":["/html/body/center[1]","//center[contains(., '502 Bad Ga')]","/html/body/center[last()-1]"]}},{"id":4,"index":4,"parentId":3,"type":0,"option":3,"title":"提取数据","sequence":[],"isInLoop":true,"position":0,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"","iframe":true,"wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"paras":[{"nodeType":0,"contentType":1,"relative":true,"name":"参数1_文本","desc":"","relativeXPath":"/h1[1]","allXPaths":["/h1[1]","//h1[contains(., '502 Bad Ga')]","/html/body/center[last()-1]/h1"],"exampleValues":[{"num":0,"value":"502BadGateway"}],"unique_index":"/h1[1]","iframe":true,"default":"","beforeJS":"","beforeJSWaitTime":0,"JS":"","JSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"downloadPic":0},{"nodeType":0,"contentType":1,"relative":true,"name":"参数2_文本","desc":"","relativeXPath":"","allXPaths":["","//center[contains(., 'nginx')]","/html/body/center"],"exampleValues":[{"num":1,"value":"nginx"}],"unique_index":"","iframe":true,"default":"","beforeJS":"","beforeJSWaitTime":0,"JS":"","JSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"downloadPic":0}],"loopType":1}}]}

+ 1
- 0
ElectronJS/tasks/117.json
File diff suppressed because it is too large
View File


+ 379
- 0
ElectronJS/tasks/118.json View File

@ -0,0 +1,379 @@
{
"id": 118,
"name": "iP地址查询--手机号码查询归属地 | 邮政编码查询 | iP地址归属地查询 | 身份证号码验证在线查询网",
"url": "https://www.ip138.com",
"links": "https://www.ip138.com",
"create_time": "7/4/2023, 8:43:31 AM",
"version": "0.3.5",
"saveThreshold": 10,
"cloudflare": 0,
"environment": 0,
"containJudge": false,
"desc": "https://www.ip138.com",
"inputParameters": [
{
"id": 0,
"name": "urlList_0",
"nodeId": 1,
"nodeName": "打开网页",
"value": "https://www.ip138.com",
"desc": "要采集的网址列表,多行以\\n分开",
"type": "string",
"exampleValue": "https://www.ip138.com"
}
],
"outputParameters": [
{
"id": 0,
"name": "参数1_文本",
"desc": "",
"type": "string",
"exampleValue": "502 Bad Gateway"
}
],
"graph": [
{
"index": 0,
"id": 0,
"parentId": 0,
"type": -1,
"option": 0,
"title": "root",
"sequence": [1, 4],
"parameters": {
"history": 1,
"tabIndex": 0,
"useLoop": false,
"xpath": "",
"wait": 0
},
"isInLoop": false
},
{
"id": 1,
"index": 1,
"parentId": 0,
"type": 0,
"option": 1,
"title": "打开网页",
"sequence": [],
"isInLoop": false,
"position": 0,
"parameters": {
"useLoop": false,
"xpath": "",
"wait": 0,
"waitType": 0,
"beforeJS": "",
"beforeJSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"url": "https://www.ip138.com",
"links": "https://www.ip138.com",
"maxWaitTime": 10,
"scrollType": 0,
"scrollCount": 1,
"scrollWaitTime": 1
}
},
{
"id": -1,
"index": 2,
"parentId": 0,
"type": 1,
"option": 8,
"title": "循环",
"sequence": [3],
"isInLoop": false,
"position": 1,
"parameters": {
"history": 4,
"tabIndex": -1,
"useLoop": false,
"xpath": "/html/body/p",
"iframe": true,
"wait": 0,
"waitType": 0,
"beforeJS": "",
"beforeJSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"scrollType": 0,
"scrollCount": 1,
"scrollWaitTime": 1,
"loopType": 1,
"pathList": "",
"textList": "",
"code": "",
"waitTime": 0,
"exitCount": 0,
"historyWait": 2,
"breakMode": 0,
"breakCode": "",
"breakCodeWaitTime": 0,
"allXPaths": [
"/html/body/p[1]",
"//p[contains(., '您的iP地址是:[')]",
"/html/body/p[last()-2]"
]
}
},
{
"id": -1,
"index": 3,
"parentId": 2,
"type": 0,
"option": 3,
"title": "提取数据",
"sequence": [],
"isInLoop": true,
"position": 0,
"parameters": {
"history": 4,
"tabIndex": -1,
"useLoop": false,
"xpath": "",
"iframe": true,
"wait": 0,
"waitType": 0,
"beforeJS": "",
"beforeJSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"paras": [
{
"nodeType": 0,
"contentType": 1,
"relative": true,
"name": "参数1_文本",
"desc": "",
"relativeXPath": "",
"allXPaths": [
"",
"//p[contains(., '您的iP地址是:[')]",
"/html/body/p[last()-2]"
],
"exampleValues": [
{ "num": 0, "value": "您的iP地址是:[]来自:新加坡Singtel" }
],
"unique_index": "",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 1,
"contentType": 0,
"relative": true,
"name": "参数2_链接文本",
"desc": "",
"relativeXPath": "/a[1]",
"allXPaths": [
"/a[1]",
"//a[contains(., '220.255.29')]",
"/html/body/p[last()-2]/a[last()-1]"
],
"exampleValues": [{ "num": 0, "value": "220.255.29.208" }],
"unique_index": "/a[1]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 2,
"contentType": 0,
"relative": true,
"name": "参数3_链接地址",
"desc": "",
"relativeXPath": "/a[1]",
"allXPaths": [
"/a[1]",
"//a[contains(., '220.255.29')]",
"/html/body/p[last()-2]/a[last()-1]"
],
"exampleValues": [
{
"num": 0,
"value": "https://www.ip138.com/iplookup.php?ip=220.255.29.208&action=2"
}
],
"unique_index": "/a[1]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 1,
"contentType": 0,
"relative": true,
"name": "参数4_链接文本",
"desc": "",
"relativeXPath": "/a[2]",
"allXPaths": [
"/a[2]",
"//a[contains(., '')]",
"/html/body/p[last()-2]/a"
],
"exampleValues": [{ "num": 0, "value": "" }],
"unique_index": "/a[2]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 2,
"contentType": 0,
"relative": true,
"name": "参数5_链接地址",
"desc": "",
"relativeXPath": "/a[2]",
"allXPaths": [
"/a[2]",
"//a[contains(., '')]",
"/html/body/p[last()-2]/a"
],
"exampleValues": [
{ "num": 0, "value": "https://www.ipshudi.com/" }
],
"unique_index": "/a[2]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 4,
"contentType": 1,
"relative": true,
"name": "参数6_图片地址",
"desc": "",
"relativeXPath": "/a[2]/img[1]",
"allXPaths": [
"/a[2]/img[1]",
"//img[contains(., '')]",
"/html/body/p[last()-2]/a/img"
],
"exampleValues": [
{ "num": 0, "value": "https://6.ipchaxun.net/220.255.29.208.gif" }
],
"unique_index": "/a[2]/img[1]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
},
{
"nodeType": 0,
"contentType": 1,
"relative": true,
"name": "参数7_文本",
"desc": "",
"relativeXPath": "/a[1]/font[1]",
"allXPaths": [
"/a[1]/font[1]",
"//font[contains(., 'ip查询api接口')]",
"/html/body/p[last()-1]/a/font"
],
"exampleValues": [{ "num": 1, "value": "ip查询api接口" }],
"unique_index": "/a[1]/font[1]",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
}
],
"loopType": 1
}
},
{
"id": 2,
"index": 4,
"parentId": 0,
"type": 0,
"option": 3,
"title": "提取数据",
"sequence": [],
"isInLoop": false,
"position": 1,
"parameters": {
"history": 3,
"tabIndex": -1,
"useLoop": false,
"xpath": "",
"iframe": true,
"wait": 0,
"waitType": 0,
"beforeJS": "",
"beforeJSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"paras": [
{
"nodeType": 0,
"contentType": 0,
"relative": false,
"name": "参数1_文本",
"desc": "",
"extractType": 0,
"relativeXPath": "/html/body/center[1]/h1[1]",
"allXPaths": [
"/html/body/center[1]/h1[1]",
"//h1[contains(., '502 Bad Ga')]",
"/html/body/center[last()-1]/h1"
],
"exampleValues": [{ "num": 0, "value": "502 Bad Gateway" }],
"unique_index": "l37gwwpsg29ljnkgn7r",
"iframe": true,
"default": "",
"beforeJS": "",
"beforeJSWaitTime": 0,
"JS": "",
"JSWaitTime": 0,
"afterJS": "",
"afterJSWaitTime": 0,
"downloadPic": 0
}
]
}
}
]
}

+ 1
- 0
ElectronJS/tasks/119.json View File

@ -0,0 +1 @@
{"id":119,"name":"","url":"https://lihkg.com/","links":"https://lihkg.com/thread/3433502/page/1","create_time":"7/4/2023, 5:09:13 PM","version":"0.3.5","saveThreshold":10,"cloudflare":0,"environment":0,"containJudge":false,"desc":"https://lihkg.com/","inputParameters":[{"id":0,"name":"urlList_0","nodeId":1,"nodeName":"打开网页","value":"https://lihkg.com/thread/3433502/page/1","desc":"要采集的网址列表,多行以\\n分开","type":"string","exampleValue":"https://lihkg.com/thread/3433502/page/1"}],"outputParameters":[],"graph":[{"index":0,"id":0,"parentId":0,"type":-1,"option":0,"title":"root","sequence":[1,6],"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","wait":0},"isInLoop":false},{"id":1,"index":1,"parentId":0,"type":0,"option":1,"title":"打开网页","sequence":[],"isInLoop":false,"position":0,"parameters":{"useLoop":false,"xpath":"","wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"url":"https://lihkg.com/","links":"https://lihkg.com/thread/3433502/page/1","maxWaitTime":10,"scrollType":"2","scrollCount":3,"scrollWaitTime":1}},{"id":-1,"index":2,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"//*[contains(@class, \"P3e8vKaXmUeXC9dJgjnsu\")]/div[1]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[2]/div[1]/div[1]","//div[contains(., 'LIHKG 討論區使')]","/html/body/div[last()-4]/div[last()-2]/div/div/div"]}},{"id":-1,"index":3,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"//*[contains(@class, \"P3e8vKaXmUeXC9dJgjnsu\")]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[2]/div[1]","//div[contains(., 'LIHKG 討論區使')]","//DIV[@class='P3e8vKaXmUeXC9dJgjnsu']","/html/body/div[last()-4]/div[last()-2]/div/div"]}},{"id":-1,"index":4,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"//*[contains(@class, \"_1PdImYJBCsN8lH0MB4tnqV\")]/a[1]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":"2","scrollCount":5,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[1]/div[1]/ul[1]/li[1]/a[1]","//a[contains(., '最新')]","/html/body/div[last()-4]/div[last()-2]/div[last()-1]/div[last()-1]/ul/li[last()-1]/a"]}},{"id":-1,"index":5,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":3,"tabIndex":-1,"useLoop":false,"xpath":"//*[contains(@class, \"P3e8vKaXmUeXC9dJgjnsu\")]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[2]/div[1]","//div[contains(., 'LIHKG 討論區使')]","//DIV[@class='P3e8vKaXmUeXC9dJgjnsu']","/html/body/div[last()-4]/div[last()-2]/div/div"]}},{"id":2,"index":6,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":5,"tabIndex":-1,"useLoop":false,"xpath":"//*[@id=\"1\"]/div[1]/small[1]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":"2","scrollCount":4,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div[2]/div[2]/div[1]/div[1]/small[1]","//small[contains(., '#1李芯悅•1 小時')]","//SMALL[@class='_1VcuFUmnOEK51TsshmrnJM']","/html/body/div[last()-5]/div[last()-2]/div/div/div[last()-2]/div/div[last()-24]/div[last()-1]/div/small"]}}]}

+ 1
- 0
ElectronJS/tasks/120.json View File

@ -0,0 +1 @@
{"id":120,"name":"京东全球版-专业的综合网上购物商城","url":"https://www.jd.com","links":"https://www.jd.com","create_time":"7/4/2023, 5:36:05 PM","version":"0.3.5","saveThreshold":10,"cloudflare":0,"environment":0,"containJudge":false,"desc":"https://www.jd.com","inputParameters":[{"id":0,"name":"urlList_0","nodeId":1,"nodeName":"打开网页","value":"https://www.jd.com","desc":"要采集的网址列表,多行以\\n分开","type":"string","exampleValue":"https://www.jd.com"},{"id":1,"name":"inputText_1","nodeName":"输入文字","nodeId":3,"desc":"要输入的文本,如京东搜索框输入:电脑","type":"string","exampleValue":"123Field[\"自定义操作\"]456","value":"123Field[\"自定义操作\"]456"}],"outputParameters":[{"id":0,"name":"自定义操作","desc":"自定义操作返回的数据","type":"string","exampleValue":""}],"graph":[{"index":0,"id":0,"parentId":0,"type":-1,"option":0,"title":"root","sequence":[1,2,3],"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","wait":0},"isInLoop":false},{"id":1,"index":1,"parentId":0,"type":0,"option":1,"title":"打开网页","sequence":[],"isInLoop":false,"position":0,"parameters":{"useLoop":false,"xpath":"","wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"url":"https://www.jd.com","links":"https://www.jd.com","maxWaitTime":10,"scrollType":0,"scrollCount":1,"scrollWaitTime":1}},{"id":2,"index":2,"parentId":0,"type":0,"option":5,"title":"自定义操作","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","iframe":false,"wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"codeMode":"1","code":"python D:/test.py","waitTime":0,"recordASField":"1"}},{"id":3,"index":3,"parentId":0,"type":0,"option":4,"title":"输入文字","sequence":[],"isInLoop":false,"position":2,"parameters":{"history":4,"tabIndex":-1,"useLoop":false,"xpath":"//*[@id=\"key\"]","iframe":false,"wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"value":"123Field[\"自定义操作\"]456","allXPaths":["/html/body/div[4]/div[1]/div[2]/div[1]/input[1]","//input[contains(., '')]","id(\"key\")","//INPUT[@class='text defcolor']","/html/body/div[last()-6]/div/div[last()-2]/div/input"]}}]}

+ 1
- 0
ElectronJS/tasks/121.json View File

@ -0,0 +1 @@
{"id":121,"name":"京东全球版-专业的综合网上购物商城","url":"https://www.jd.com","links":"https://www.jd.com","create_time":"7/4/2023, 5:55:11 PM","version":"0.3.5","saveThreshold":10,"cloudflare":0,"environment":0,"containJudge":false,"desc":"https://www.jd.com","inputParameters":[{"id":0,"name":"urlList_0","nodeId":1,"nodeName":"打开网页","value":"https://www.jd.com","desc":"要采集的网址列表,多行以\\n分开","type":"string","exampleValue":"https://www.jd.com"}],"outputParameters":[],"graph":[{"index":0,"id":0,"parentId":0,"type":-1,"option":0,"title":"root","sequence":[1],"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","wait":0},"isInLoop":false},{"id":1,"index":1,"parentId":0,"type":0,"option":1,"title":"打开网页","sequence":[],"isInLoop":false,"position":0,"parameters":{"useLoop":false,"xpath":"","wait":15,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"url":"https://www.jd.com","links":"https://www.jd.com","maxWaitTime":10,"scrollType":"2","scrollCount":3,"scrollWaitTime":1}},{"id":-1,"index":2,"parentId":0,"type":1,"option":8,"title":"循环","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","iframe":false,"wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"loopType":0,"pathList":"","textList":"","code":"","waitTime":0,"exitCount":0,"historyWait":2,"breakMode":0,"breakCode":"","breakCodeWaitTime":0}},{"id":-1,"index":3,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":0,"scrollCount":1,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[]}}]}

+ 1
- 0
ElectronJS/tasks/122.json View File

@ -0,0 +1 @@
{"id":122,"name":"bbs","url":"https://lihkg.com/thread/3429557/page/1","links":"https://lihkg.com/thread/3429557/page/1","create_time":"7/4/2023, 5:57:56 PM","version":"0.3.5","saveThreshold":10,"cloudflare":0,"environment":0,"containJudge":false,"desc":"https://lihkg.com/thread/3429557/page/1","inputParameters":[{"id":0,"name":"urlList_0","nodeId":1,"nodeName":"打开网页","value":"https://lihkg.com/thread/3429557/page/1","desc":"要采集的网址列表,多行以\\n分开","type":"string","exampleValue":"https://lihkg.com/thread/3429557/page/1"}],"outputParameters":[],"graph":[{"index":0,"id":0,"parentId":0,"type":-1,"option":0,"title":"root","sequence":[1,2],"parameters":{"history":1,"tabIndex":0,"useLoop":false,"xpath":"","wait":0},"isInLoop":false},{"id":1,"index":1,"parentId":0,"type":0,"option":1,"title":"打开网页","sequence":[],"isInLoop":false,"position":0,"parameters":{"useLoop":false,"xpath":"","wait":0,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"url":"https://lihkg.com/thread/3429557/page/1","links":"https://lihkg.com/thread/3429557/page/1","maxWaitTime":10,"scrollType":0,"scrollCount":1,"scrollWaitTime":1}},{"id":2,"index":2,"parentId":0,"type":0,"option":2,"title":"点击元素","sequence":[],"isInLoop":false,"position":1,"parameters":{"history":4,"tabIndex":-1,"useLoop":false,"xpath":"//*[contains(@class, \"_1PdImYJBCsN8lH0MB4tnqV\")]/a[1]","iframe":false,"wait":2,"waitType":0,"beforeJS":"","beforeJSWaitTime":0,"afterJS":"","afterJSWaitTime":0,"scrollType":"2","scrollCount":3,"scrollWaitTime":1,"clickWay":0,"maxWaitTime":10,"paras":[],"allXPaths":["/html/body/div[1]/div[2]/div[1]/div[1]/ul[1]/li[1]/a[1]","//a[contains(., '最新')]","/html/body/div[last()-5]/div[last()-2]/div[last()-1]/div[last()-1]/ul/li[last()-1]/a"]}}]}

+ 1
- 0
ElectronJS/tasks/123.json
File diff suppressed because it is too large
View File


+ 1
- 1
ExecuteStage/.vscode/launch.json View File

@ -10,7 +10,7 @@
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true,
"args": ["--id", "[5]", "--read_type", "remote", "--headless", "0"]
"args": ["--id", "[18]", "--read_type", "remote", "--headless", "0"]
// "args": ["--id", "[2]", "--read_type", "remote", "--headless", "0", "--saved_file_name", "YOUTUBE"]
// "args": ["--id", "[44]", "--headless", "0", "--user_data", "1"]
}

+ 64
- 33
ExecuteStage/easyspider_executestage.py View File

@ -12,7 +12,7 @@ import sys
# import base64
# import hashlib
import time
import keyboard
# import keyboard
import requests
from lxml import etree
from selenium.webdriver.chrome.options import Options
@ -28,18 +28,18 @@ from selenium.common.exceptions import StaleElementReferenceException, InvalidSe
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
import undetected_chromedriver as uc
import random
# import numpy
import csv
import os
from selenium.webdriver.common.by import By
from commandline_config import Config
import pytesseract
from PIL import Image
import uuid
# import uuid
from threading import Thread, Event
from myChrome import MyChrome
from utils import check_pause, download_image, get_output_code, isnull
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "none"
@ -111,11 +111,23 @@ class BrowserThread(Thread):
# 检测如果没有复杂的操作,优化提取数据流程
def preprocess(self):
for node in self.procedure:
try:
iframe = node["parameters"]["iframe"]
except:
node["parameters"]["iframe"] = False
if node["option"] == 3: # 提取数据操作
paras = node["parameters"]["paras"]
for para in paras:
try:
iframe = para["iframe"]
except:
para["iframe"] = False
if para["beforeJS"] == "" and para["afterJS"] == "" and para["contentType"] <= 1 and para["nodeType"] <= 2:
para["optimizable"] = True
# iframe中提取数据的绝对寻址操作不可优化
if para["relative"] == False and para["iframe"] == True:
para["optimizable"] = False
else:
para["optimizable"] = True
else:
para["optimizable"] = False
@ -170,7 +182,8 @@ class BrowserThread(Thread):
if scrollType != 0 and para["scrollCount"] > 0: # 控制屏幕向下滚动
for i in range(para["scrollCount"]):
self.Log("Wait for set second after screen scrolling")
body = self.browser.find_element(By.CSS_SELECTOR, "body")
body = self.browser.find_element(
By.CSS_SELECTOR, "body", iframe=para["iframe"])
if scrollType == 1:
body.send_keys(Keys.PAGE_DOWN)
elif scrollType == 2:
@ -183,7 +196,8 @@ class BrowserThread(Thread):
if scrollType != 0 and para["scrollCount"] > 0: # 控制屏幕向下滚动
for i in range(para["scrollCount"]):
self.Log("Wait for set second after screen scrolling")
body = self.browser.find_element(By.CSS_SELECTOR, "body")
body = self.browser.find_element(
By.CSS_SELECTOR, "body", iframe=para["iframe"])
if scrollType == 1:
body.send_keys(Keys.PGDN)
elif scrollType == 2:
@ -253,7 +267,8 @@ class BrowserThread(Thread):
max_wait_time = int(paras["waitTime"])
if codeMode == 2: # 使用循环的情况下,传入的clickPath就是实际的xpath
try:
elements = self.browser.find_elements(By.XPATH, loopPath)
elements = self.browser.find_elements(
By.XPATH, loopPath, iframe=paras["iframe"])
element = elements[index]
output = self.execute_code(
codeMode, code, max_wait_time, element)
@ -277,7 +292,7 @@ class BrowserThread(Thread):
optionValue = para["optionValue"]
try:
dropdown = Select(self.browser.find_element(
By.XPATH, para["xpath"]))
By.XPATH, para["xpath"], iframe=para["iframe"]))
try:
if optionMode == 0:
# 获取当前选中的选项索引
@ -310,7 +325,8 @@ class BrowserThread(Thread):
index = 0
path = para["xpath"] # 不然使用元素定义的xpath
try:
elements = self.browser.find_elements(By.XPATH, path)
elements = self.browser.find_elements(
By.XPATH, path, iframe=para["iframe"])
element = elements[index]
try:
ActionChains(self.browser).move_to_element(element).perform()
@ -396,7 +412,7 @@ class BrowserThread(Thread):
continue
elif tType == 2: # 当前页面包含元素
try:
if self.browser.find_element(By.XPATH, cnode["parameters"]["value"]):
if self.browser.find_element(By.XPATH, cnode["parameters"]["value"], iframe=cnode["parameters"]["iframe"]):
executeBranchId = i
break
except: # 找不到元素或者xpath写错了,下一个条件
@ -410,7 +426,7 @@ class BrowserThread(Thread):
continue
elif tType == 4: # 当前循环元素包括元素
try:
if loopElement.find_element(By.XPATH, cnode["parameters"]["value"][1:]):
if loopElement.find_element(By.XPATH, cnode["parameters"]["value"][1:], iframe=cnode["parameters"]["iframe"]):
executeBranchId = i
break
except: # 找不到元素或者xpath写错了,下一个条件
@ -449,7 +465,8 @@ class BrowserThread(Thread):
'return history.length') # 记录本次循环内的history的length
self.history["index"] = thisHistoryLength
self.history["handle"] = thisHandle
if node["parameters"]["iframe"]:
self.browser.switch_to.default_content() # 循环前切换到主文档
if int(node["parameters"]["loopType"]) == 0: # 单个元素循环
# 无跳转标签页操作
count = 0 # 执行次数
@ -457,7 +474,7 @@ class BrowserThread(Thread):
try:
finished = False
element = self.browser.find_element(
By.XPATH, node["parameters"]["xpath"])
By.XPATH, node["parameters"]["xpath"], iframe=node["parameters"]["iframe"])
for i in node["sequence"]: # 挨个执行操作
self.executeNode(
i, element, node["parameters"]["xpath"], 0)
@ -504,7 +521,7 @@ class BrowserThread(Thread):
elif int(node["parameters"]["loopType"]) == 1: # 不固定元素列表
try:
elements = self.browser.find_elements(By.XPATH,
node["parameters"]["xpath"])
node["parameters"]["xpath"], iframe=node["parameters"]["iframe"])
if len(elements) == 0:
print("Loop element not found: ",
node["parameters"]["xpath"])
@ -552,7 +569,8 @@ class BrowserThread(Thread):
# 千万不要忘了分割!!
for path in node["parameters"]["pathList"].split("\n"):
try:
element = self.browser.find_element(By.XPATH, path)
element = self.browser.find_element(
By.XPATH, path, iframe=node["parameters"]["iframe"])
for i in node["sequence"]: # 挨个执行操作
self.executeNode(i, element, path, 0)
if self.browser.current_window_handle != thisHandle: # 如果执行完一次循环之后标签页的位置发生了变化
@ -633,10 +651,11 @@ class BrowserThread(Thread):
self.executeNode(i, code, node["parameters"]["xpath"], 0)
self.history["index"] = thisHistoryLength
self.history["handle"] = self.browser.current_window_handle
if node["parameters"]["iframe"]:
self.browser.switch_to.default_content()
self.scrollDown(node["parameters"])
# 打开网页事件
def openPage(self, para, loopValue):
time.sleep(1) # 打开网页后强行等待至少1秒
if len(self.browser.window_handles) > 1:
@ -677,19 +696,25 @@ class BrowserThread(Thread):
self.Log('Time out after set seconds when loading page: ' + url)
self.recordLog(
'Time out after set seconds when loading page: ' + url)
self.browser.execute_script('window.stop()')
try:
self.browser.execute_script('window.stop()')
except:
pass
try:
self.history["index"] = self.browser.execute_script(
"return history.length")
except TimeoutException:
self.browser.execute_script('window.stop()')
self.history["index"] = self.browser.execute_script(
"return history.length")
try:
self.browser.execute_script('window.stop()')
self.history["index"] = self.browser.execute_script(
"return history.length")
except:
self.history["index"] = 0
self.scrollDown(para) # 控制屏幕向下滚动
if self.containJudge:
try:
self.bodyText = self.browser.find_element(
By.CSS_SELECTOR, "body").text
By.CSS_SELECTOR, "body", iframe=False).text
self.Log('URL Page: ' + url)
self.recordLog('URL Page: ' + url)
except TimeoutException:
@ -702,7 +727,7 @@ class BrowserThread(Thread):
self.Log("Need to wait 1 second to get body text")
# 再执行一遍
self.bodyText = self.browser.find_element(
By.CSS_SELECTOR, "body").text
By.CSS_SELECTOR, "body", iframe=False).text
except Exception as e:
self.Log(e)
self.recordLog(str(e))
@ -713,7 +738,8 @@ class BrowserThread(Thread):
time.sleep(0.1) # 输入之前等待0.1秒
self.Log("Wait 0.1 second before input")
try:
textbox = self.browser.find_element(By.XPATH, para["xpath"])
textbox = self.browser.find_element(
By.XPATH, para["xpath"], iframe=para["iframe"])
# textbox.send_keys(Keys.CONTROL, 'a')
# textbox.send_keys(Keys.BACKSPACE)
self.execute_code(
@ -770,7 +796,8 @@ class BrowserThread(Thread):
self.browser.set_script_timeout(maxWaitTime)
# 点击前对该元素执行一段JavaScript代码
try:
element = self.browser.find_element(By.XPATH, path)
element = self.browser.find_element(
By.XPATH, path, iframe=para["iframe"])
if para["beforeJS"] != "":
self.execute_code(2, para["beforeJS"],
para["beforeJSWaitTime"], element)
@ -786,7 +813,7 @@ class BrowserThread(Thread):
except:
click_way = 0
try:
if click_way == 0: # 用selenium的点击方法
if click_way == 0 or para["iframe"]: # 用selenium的点击方法
actions = ActionChains(self.browser) # 实例化一个action对象
actions.click(element).perform()
elif click_way == 1: # 用js的点击方法
@ -804,7 +831,8 @@ class BrowserThread(Thread):
# 点击前对该元素执行一段JavaScript代码
try:
if para["afterJS"] != "":
element = self.browser.find_element(By.XPATH, path)
element = self.browser.find_element(
By.XPATH, path, iframe=para["iframe"])
self.execute_code(2, para["afterJS"],
para["afterJSWaitTime"], element)
except:
@ -812,6 +840,8 @@ class BrowserThread(Thread):
self.recordLog("Cannot find element:" +
path + ", please try to set the wait time before executing this operation")
print("找不到要点击的元素:" + path + ",请尝试在执行此操作前设置等待时间")
if para["iframe"]:
self.browser.switch_to.default_content()
waitTime = float(para["wait"]) + 0.01 # 点击之后等待
try:
waitType = int(para["waitType"])
@ -1071,13 +1101,13 @@ class BrowserThread(Thread):
p["relativeXPath"] + ")" + \
"[" + str(index + 1) + "]"
element = self.browser.find_element(
By.XPATH, full_path)
By.XPATH, full_path, iframe=p["iframe"])
else:
element = loopElement.find_element(By.XPATH,
p["relativeXPath"][1:])
else:
element = self.browser.find_element(
By.XPATH, p["relativeXPath"])
By.XPATH, p["relativeXPath"], iframe=p["iframe"])
except (NoSuchElementException, InvalidSelectorException, StaleElementReferenceException): # 找不到元素的时候,使用默认值
# print(p)
try:
@ -1110,10 +1140,11 @@ class BrowserThread(Thread):
p["relativeXPath"][1:])
else:
element = self.browser.find_element(
By.XPATH, p["relativeXPath"])
By.XPATH, p["relativeXPath"], iframe=p["iframe"])
# rt.end()
else:
element = self.browser.find_element(By.XPATH, "//body")
element = self.browser.find_element(
By.XPATH, "//body", iframe=p["iframe"])
try:
self.execute_code(
2, p["beforeJS"], p["beforeJSWaitTime"], element) # 执行前置js
@ -1135,7 +1166,7 @@ class BrowserThread(Thread):
'StaleElementReferenceException: loopElement+relativeXPath')
else:
element = self.browser.find_element(
By.XPATH, p["relativeXPath"])
By.XPATH, p["relativeXPath"], iframe=p["iframe"])
self.recordLog(
'StaleElementReferenceException: relativeXPath')
content = self.get_content(p, element)
@ -1327,7 +1358,7 @@ if __name__ == '__main__':
'mobileEmulation', {'deviceName': 'iPhone X'}) # 模拟iPhone X浏览
except:
pass
browser_t = webdriver.Chrome(
browser_t = MyChrome(
options=options, chrome_options=option, executable_path=driver_path)
elif cloudflare == 1:
browser_t = uc.Chrome(

+ 75
- 0
ExecuteStage/myChrome.py View File

@ -0,0 +1,75 @@
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException, InvalidSelectorException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "none"
class MyChrome(webdriver.Chrome):
def find_element(self, by=By.ID, value=None, iframe=False):
# 在这里改变查找元素的行为
if iframe:
# 获取所有的 iframe
try:
iframes = super().find_elements(By.XPATH, "//iframe")
except Exception as e:
print(e)
find_element = False
# 遍历所有的 iframe 并点击里面的元素
for iframe in iframes:
# 切换到 iframe
super().switch_to.frame(iframe)
try:
# 在 iframe 中查找并点击元素
# 在这个例子中,我们查找 XPath 为 '//div[1]' 的元素
element = super().find_element(by=by, value=value)
find_element = True
except NoSuchElementException:
print("No such element found in the iframe")
# 完成操作后切回主文档
# super().switch_to.default_content()
if find_element:
return element
if not find_element:
raise NoSuchElementException
else:
return super().find_element(by=by, value=value)
def find_elements(self, by=By.ID, value=None, iframe=False):
# 在这里改变查找元素的行为
if iframe:
# 获取所有的 iframe
iframes = iframes = super().find_elements(By.CSS_SELECTOR, "iframe")
find_element = False
# 遍历所有的 iframe 并点击里面的元素
for iframe in iframes:
# 切换到 iframe
super().switch_to.frame(iframe)
try:
# 在 iframe 中查找并点击元素
# 在这个例子中,我们查找 XPath 为 '//div[1]' 的元素
elements = super().find_elements(by=by, value=value)
find_element = True
except NoSuchElementException:
print("No such element found in the iframe")
# 完成操作后切回主文档
# super().switch_to.default_content()
if find_element:
return elements
if not find_element:
raise NoSuchElementException
else:
return super().find_elements(by=by, value=value)

+ 3
- 1
Readme.md View File

@ -42,7 +42,7 @@ Bilibili/B站视频教程:
[EasySpider介绍 - 中国地震台网采集案例](https://www.bilibili.com/video/BV1th411A7ey/)
[自动/手动同类型元素匹配功能说明](https://www.bilibili.com/video/BV1pu411a7pK/)
[设置页面向下滚动](https://www.bilibili.com/video/BV1G14y1o7Qa/)
[如何无代码可视化的爬取需要登录才能爬的网站 - 知乎网站案例](https://www.bilibili.com/video/BV1HV4y1r7v8)
@ -70,6 +70,8 @@ Bilibili/B站视频教程:
[如何同时执行多个任务(并行多开)](https://www.bilibili.com/video/BV13c411G7LE/)
[Python代码运算后的结果作为文本框的输入](https://www.bilibili.com/video/BV1kF411R7VJ/)
[实例 - 反人类网站文章采集和代码调试](https://www.bilibili.com/video/BV11W4y1D71t/)
Refer to [Youtube Playlist](https://youtube.com/playlist?list=PL0kEFEkWrT7mt9MUlEBV2DTo1QsaanUTp) to see the video tutorials of EasySpider.

Loading…
Cancel
Save