添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

现如今大多数页面,通过html5/js等方式,动态渲染页面,对于抓取动态网页,用常规的抓取方法显得力不从心。 前些年出现了phantomjs,可以有效的抓取动态页面,但phantomjs的一些缺点,内存溢出等经常出现卡死。现在该作者也停止更新phantomjs了

Now,决定弃用phantomjs!

发现新大陆

chrome自从v59版本后,推出了headless浏览器,配合Chrome DevTools Protocol,使用浏览器内核其Api,可实现分布远程调试chrome(数据抓取等)

Chrome DevTools Protocol允许工具对Chromium,Chrome和其他基于Blink的浏览器进行测试,检查,调试和配置。 许多现有项目目前使用该协议。 Chrome DevTools开发人员工具,使用此协议,团队维护其API。

Server端,在装有chrome浏览器环境的服务器中,打开chrome remote debug

以下命令在docker环境下,alpine,chrome环境中,更多chrome启动参数,参考https://peter.sh/experiments/chromium-command-line-switches/

chromium-browser --headless --no-sandbox --disable-gpu --remote-debugging-port=9222 
chrome --headless --no-sandbox --disable-gpu --remote-debugging-port=9222 --remote-debugging-address=0.0.0.0 --window-size=1920,1080 --user-data-dir=<some directory>
"description": "",
"devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C)",
"id": "(9E4790959AAB0C8FB8F309ABB204729C)",
"title": "百度一下,你就知道",
"type": "page",
"url": "https://www.baidu.com/",
"webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C)"
"description": "",
"devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(C8A6E4D304F820AC9F48AC9A34137F78)",
"id": "(C8A6E4D304F820AC9F48AC9A34137F78)",
"title": "百度一下,你就知道",
"type": "page",
"url": "https://www.baidu.com/",
"webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(C8A6E4D304F820AC9F48AC9A34137F78)"
"description": "",
"devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(E18749BAD4802F598A844A7EE14BA9C4)",
"id": "(E18749BAD4802F598A844A7EE14BA9C4)",
"title": "about:blank",
"type": "page",
"url": "about:blank",
"webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(E18749BAD4802F598A844A7EE14BA9C4)"
"description": "",
"devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.110.128:9444/devtools/page/(2C5CCAACD2BFBA9E39D73EBAB2291C87)",
"id": "(2C5CCAACD2BFBA9E39D73EBAB2291C87)",
"title": "",
"type": "page",
"url": "file:///",
"webSocketDebuggerUrl": "ws://192.168.110.128:9444/devtools/page/(2C5CCAACD2BFBA9E39D73EBAB2291C87)"

新建一个标签

http://localhost:9222/json/new
http://localhost:9222/json/new?http://www.baidu.com

关闭一个标签

http://localhost:9222/json/close/477810FF-323E-44C5-997C-89B7FAC7B158

激活标签页

http://localhost:9222/json/activate/477810FF-323E-44C5-997C-89B7FAC7B158

查看版本信息

http://localhost:9222/json/version

client端,通过websocket协议,连接至chrome remote port

ws://192.168.110.128:9444/devtools/page/(9E4790959AAB0C8FB8F309ABB204729C)
  

执行以下api接口中的命令

#打开页面
{"id":200,"method":"Page.navigate","params":{"url":"https://www.baidu.com"}}
#获取dom
{"id":200,"method":"DOM.getDocument"}
#获取html
{"id":200,"method":"DOM.getOuterHTML","params":{"nodeId":1,"backendNodeId":12}}
#获取资源树
{"id":200,"method":"Page.getResourceTree","params":{}}

通过Api接口(Runtime.evaluate)执行js,类似于chrome中的onsole输出

{"id":200,"method":"Runtime.evaluate","params":{"expression":"document.title","objectGroup":"console","includeCommandLineAPI":true,"silent":false,"contextId":1,"returnByValue":false,"generatePreview":true,"userGesture":true,"awaitPromise":false}}
{"id":200,"method":"Runtime.evaluate","params":{"expression":"document.title","objectGroup":"console","includeCommandLineAPI":true,"silent":false,"returnByValue":false,"generatePreview":true,"userGesture":true,"awaitPromise":false}}
    "id": 200,
    "result": {
        "result": {
            "type": "string",
            "value": "百度一下,你就知道"

扩展API

有很多扩展应用使用了该协议来与页面做交互调试,官网上有很多Sample Extensions

https://developer.chrome.com/extensions/samples#search:debugger

Chrome Api

https://chromedevtools.github.io/devtools-protocol/

API–模拟键盘输入

https://chromedevtools.github.io/devtools-protocol/tot/Input/

chrome启动参数

https://peter.sh/experiments/chromium-command-line-switches/

一些有意思的工具

https://developer.chrome.com/devtools/docs/debugging-clients

很多工具都使用了Chrome debugging protocol,包括phantomJS,Selenium的ChromeDriver,本质都是一样的实现,它就相当于Chrome内核提供的API让应用调用。

官网列出了很多有意思的工具:链接,因为API丰富,所以才有了这么多的chrome插件。

实现了Remote debugging protocol的node的库:

chrome-debug-protocol 使用了ES6和TypeScript https://github.com/DickvdBrink/chrome-debug-protocol chrome-remote-interface 官网推荐的 https://github.com/cyrus-and/chrome-remote-interface chrome-har-capturer 传入url,直接获取har format文件 https://github.com/cyrus-and/chrome-har-capturer

什么是WebDriver

WebDriver是一个开源工具,用于在许多浏览器上自动测试web应用。它提供了导航到网页,用户输入,JavaScript执行等功能。 WebDriver W3C标准 https://w3c.github.io/webdriver/webdriver-spec.html

什么是chromedriver

ChromeDriver是一个独立的服务,它为Chromium实现WebDriver’s wire protocol 协议 chromedriver正在实施并转向W3C标准。ChromeDriver适用于Android版Chrome和桌面版Chrome(Mac,Linux,Windows和ChromeOS)。

chromedriver已经实现的w3c标准功能 https://chromium.googlesource.com/chromium/src/+/master/docs/chromedriver_status.md

chromedriver由chromium team维护

使用Selenium驱动chromedriver

import time
#导入webdriver
from selenium import webdriver
#指定chromedriver的path位置
driver = webdriver.Chrome('/path/to/chromedriver')  # Optional argument, if not specified will search path.
driver.get('http://www.google.com/xhtml');
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()

控制chromedriver的生命周期 Controlling ChromeDriver’s lifetime

ChromeDriver类在创建时启动ChromeDriver服务器进程,并在调用退出时终止它。 这可能会浪费大量时间用于大型测试套件,其中每个测试都会创建一个ChromeDriver实例。

有两种方法可以解决这个问题:

  • Use the ChromeDriverService. This is available for most languages and allows you to start/stop the ChromeDriver server yourself. See here for a Java example (with JUnit 4): @RunWith(BlockJUnit4ClassRunner.class) public class ChromeTest extends TestCase {
  • private static ChromeDriverService service; private WebDriver driver;

    @BeforeClass public static void createAndStartService() { service = new ChromeDriverService.Builder() .usingDriverExecutable(new File(“path/to/my/chromedriver”)) .usingAnyFreePort() .build(); service.start();

    @AfterClass public static void createAndStopService() { service.stop();

    @Before public void createDriver() { driver = new RemoteWebDriver(service.getUrl(), DesiredCapabilities.chrome());

    @After public void quitDriver() { driver.quit();

    @Test public void testGoogleSearch() { driver.get(“http://www.google.com”); // rest of the test… python :

    import time

    from selenium import webdriver import selenium.webdriver.chrome.service as service

    service = service.Service(‘/path/to/chromedriver’) service.start() capabilities = {‘chrome.binary’: ‘/path/to/custom/chrome’} driver = webdriver.Remote(service.service_url, capabilities) driver.get(‘http://www.google.com/xhtml’); time.sleep(5) # Let the user actually see something! driver.quit()

    2. Start the ChromeDriver server separately before running your tests, and connect to it using the Remote WebDriver. Terminal:

    $ ./chromedriver Started ChromeDriver port=9515 version=14.0.836.0

    java:

    WebDriver driver = new RemoteWebDriver(“http://127.0.0.1:9515”, DesiredCapabilities.chrome()); driver.get(“http://www.google.com”);

    https://div.io/topic/1464 https://sites.google.com/a/chromium.org/chromedriver/ https://github.com/SeleniumHQ/selenium/wiki/JsonWireProtocol

    https://github.com/seleniumhq/selenium https://sites.google.com/a/chromium.org/chromedriver/getting-started

    https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities.md https://sites.google.com/a/chromium.org/chromedriver/capabilities http://peter.sh/examples/?/chromium-switches.html