有时候频繁的爬取第三方站点内容,会被站点的防火墙拦截,IP拉黑,所以这时候,就要用到代理IP,拉黑一个就换一个;
htmlunit实用代理的方式比较简单,WebClient重载构造方法就有提供,我们看下演示代码:
package com.hbk.htmlunit;
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomElement;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlListItem;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
* @author 黄宝康 获取http://www.3dns.com.cn的源代码
public class HtmlUnitTest {
public static void main(String[] args) {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45,"118.212.137.135",31288); // 实例化Web客户端
try {
HtmlPage page = webClient.getPage("http://www.3dns.com.cn/");
System.out.println("网页html:"+page.asXml());// 获取Html
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
webClient.close(); // 关闭客户端,释放内存
运行结果一样,只是速度会慢点,因为用了代理。
这里代理IP如何找 ,很多网站都有提供的,我们介绍一个,http://www.66ip.cn
我处地区在江西,选择江西相应免费的代理ip