java读取.html文件并获取数据

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
冲动的显示器 · C#确认删除对话框_js确认删除对话框_js ...· 4 月前 ·
谦逊的盒饭 · 获取JSON中所有的KEY - ...· 7 月前 ·
谦虚好学的毛衣 · SOX2 x MDM2 x CDH3 x ...· 7 月前 ·
爱逃课的毛豆 · 24082：图层的符号将降级—ArcGIS ...· 1 年前 ·
睡不着的卡布奇诺 · js中字符串转换为json时控制字符处理_\ ...· 1 年前 ·
	  String filePath = "D:\\工作文档\\国民经济行业分类\\报告_test.html";
      //读取.html文件为字符串
      String htmlStr = toHtmlString(new File(filePath));
      //解析字符串为Document对象
      Document doc = Jsoup.parse(htmlStr);
      //获取body元素，获取class="fc"的table元素
      Elements table = doc.body().getElementsByClass("fc");
      //获取tbody元素
      Elements children = table.first().children();
      //获取tr元素集合
      Elements tr = children.get(0).getElementsByTag("tr");
      //遍历tr元素，获取td元素，并打印
      for(int i=0; i<tr.size(); i++){
          Element e1 = tr.get(i);
          Elements td = e1.getElementsByTag("td");
          for(int j=0; j<td.size(); j++){
              String value = td.get(j).text();
              System.out.print("  "+value);
          System.out.println();
 *  读取本地html文件里的html代码
 * @return
public static String toHtmlString(File file) {
    // 获取HTML文件流
    StringBuffer htmlSb = new StringBuffer();
    try {
        BufferedReader br = new BufferedReader(new InputStreamReader(
                new FileInputStream(file), "unicode"));
        while (br.ready()) {
            htmlSb.append(br.readLine());
        br.close();
        // 删除临时文件
        //file.delete();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    // HTML文件字符串
    String htmlStr = htmlSb.toString();
    // 返回经过清洁的html文本
    return htmlStr;
Document属性及方法介绍
 
1、对象的属性
 
1.document.title //设置文档的标题（HTML的title标签）
 2.document.bgColor //设置背景页面的颜色
 3.document.fgColor //设置前景色（文本颜色）
 4.documen.URL //设置URL属性在同一个窗口打开其他页面
 5.document.linkColor //未点击过的链接颜色
 6.document.cookie //设置和读出cookie
 7.document.fileSize //设置文件大写，（注：只读属性）
 8.document.charset //设置字符集
 9.document.alinkColor //激活链接颜色（注：焦点在链接上）
 10.document.vlinkColor //已点击过的链接颜色
 11.document.fileCreatedData //文件创建日期（注：只读属性）
 12.document.ModifiedDate //文件修改日期（注：只读属性）




    
 
2、常用的对象的方法
 
1.document.write() //动态向页面添加内容
 2.document.createElement(Tag) //创建一个html标签对象
 3.document.getElementById(ID) //获得指定Id的对象
 4.document.getElementByClassName(className) //获得指定class值的对象（数组）
 5.document.getElementByTagName（TagName） //获得指定的tag对象
 6.document.body.appendChild(Tag) //向body中添加创建的新的标签对象
 7.document.getElementByName(Name) //获得指定的Name值的对象 
HTML测试文件
 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=unicode">
<style>
.AlignLeft { text-align: left; }
.AlignCenter { text-align: center; }
.AlignRight { text-align: right; }
body { font-family: sans-serif; font-size: 11pt; }
td { vertical-align: top; padding-left: 4px; padding-right: 4px; }
tr.SectionGap td { font-size: 4px; border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionBegin td { border-left: none; border-top: none; border-right: 1px solid Black; }
tr.SectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; }
tr.SectionMiddle td { border-left: none; border-top: none; border-right: 1px solid Black; }
tr.SubsectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; }
tr.SubsectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; }
table.fc { border-top: 1px solid Black; border-left: 1px solid Black; width: 100%; font-family: monospace; font-size: 10pt; }
td.DataItemHeader { color: #000000; background-color: #FFFFFF; background-color: #E7E7E7; padding-top: 8px; }
td.DataItemInsigDiff { color: #000000; background-color: #EEEEFF; }
td.DataItemInsigOrphan { color: #000000; background-color: #FAEEFF; }
td.DataItemNum { color: #696969; background-color: #F0F0F0; }
td.DataItemSame { color: #000000; background-color: #FFFFFF; }
td.DataItemSigDiff { color: #000000; background-color: #FFE3E3; }
td.DataItemSigOrphan { color: #000000; background-color: #F1E3FF; }
.DataSegInsigDiff { color: #0000FF; }
.DataSegSigDiff { color: #FF0000; }
</style>
<title>国民经济行业分类比较</title>
</head>
国民经济行业分类比较<br/>
已产生: 2022/2/16 14:00:55<br/>
&nbsp; &nbsp;
模式:&nbsp; 全部 &nbsp;
左边文件: D:\工作文档\国民经济行业分类\Eleasing-国民经济行业分类 (2).xlsx &nbsp;
右边文件: D:\工作文档\国民经济行业分类\人行-国民经济分类.xlsx &nbsp;
<table class="fc" cellspacing="0" cellpadding="0">
<tr class="SectionAll">
<td class="DataItemHeader">1:</td>
<td class="DataItemHeader">2:</td>
<td class="DataItemHeader">&nbsp;</td>
<td class="DataItemHeader">1:</td>
<td class="DataItemHeader">2:</td>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">A0111</td>
<td class="DataItemSame AlignLeft">稻谷种植</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">A0111</td>
<td class="DataItemSame AlignLeft">稻谷种植</td>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">A0112</td>
<td class="DataItemSame AlignLeft">小麦种植</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">A0112</td>
<td class="DataItemSame AlignLeft">小麦种植</td>
<tr class="SectionMiddle">
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">T9600</span></td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">国际组织</span></td>
<td class="AlignCenter">+-</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
<tr class="SectionMiddle">
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="DataItemSame AlignLeft">&nbsp;</td>
<td class="AlignCenter">-+</td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">T9700</span></td>
<td class="DataItemSigOrphan AlignLeft"><span class="DataSegSigDiff">国际组织</span></td>
<tr class="SectionEnd">
<td class="DataItemSame AlignLeft">代码</td>
<td class="DataItemSame AlignLeft">中文名称</td>
<td class="AlignCenter">=</td>
<td class="DataItemSame AlignLeft">代码</td>
<td class="DataItemSame AlignLeft">中文名称</td>
</table>
</body>
</html>
参考文档：
 java读取本地html文本
 document对象常用的属性和方法
                                    htmlparser是一个纯的java写的html解析的库，它不依赖于其它的java库文件，主要用于改造或提取html。它能超高速解析html，而且不会出错。现在htmlparser最新版本为2.0。毫不夸张地说，htmlparser就是目前最好的html解析和分析的工具。无论你是想抓取网页数据还是改造html的内容，用了htmlparser绝对会忍不住称赞例子：import java.io.Bu...
                                    使用java自带的swing解析html，用起来简单，速度也很快。首先要导入javax.swing.text.*和javax.swing.text.html.*两个包。然后定义一个parser的类，继承了javax.swing.text.html.HTMLEditorKit.ParserCallback这个类，在javax.swing.text.html.HTMLEditorKit.ParserC
                                    给大家整理了一些有关【Java,HTML】的项目学习资料（附讲解～～）：https://edu.51cto.com/course/35714.htmlhttps://edu.51cto.com/course/32916.html项目方案：获取HTML标签中的Style值
一、背景与目标
在Web开发中，获取HTML元素...
                                    这里的获取的是html文件中body中的所有标签以及内容package com.lmt.service.file;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.InputStreamReader;import java.io.Reader;import ...
1.document.title    //设置文档的标题（HTML的title标签）
2.document.bgColor   //设置背景页面的颜色
3.document.fgColor    //设置前景色（文本颜色）
4.documen.URL        //设置URL属性在同一个窗口打开其他页面
5.document.linkColor  //未点击过的链接颜色
6....
                                    有许多情况下，你可能需要使用 Java 解析 HTML。当你需要从网页中提取特定数据时，例如爬取网页内容并提取标题、链接、表格数据等。HTML 解析是必要的，因为它使你能够以结构化的方式访问和提取所需的信息。如果你有一些包含 HTML 标记的文本数据，你可能希望将其清理并转换为纯文本或其他格式。HTML 解析器可以帮助你处理这些标记，提取文本内容。在开发 Web 爬虫或抓取工具时，HTML 解析是关键的。你需要解析从网页获取的 HTML 内容，从中提取有用的信息，并进行进一步的处理和存储。
使用jsoup对html文档进行解析，每个html标签（Element）、文本（TextNode）都抽象成一个Node，每一个Node都含有childNodes()方法来枚举其包含的Node，这里通过递归来提取文本内容，对于非TextNode的Node，我们遍历其孩子Node，对于TextNode我们直接返回结果。
其中jsoup的gradle依赖为compile 'org.j...
                                    如果返回的是json格式的字符串，可以用jsonUtil的工具类，直接转换成jsonobject，然后直接jsonObject.get("key")，就可以拿到数据，那如果返回的是一个html页面呢，应该怎么获取html页面中的信息呢;根据 HTML 的结构和需要提取的内容，可以使用 Jsoup 提供的选择器来获取指定的元素。Java 提供了多个库用于解析 HTML，比如 Jsoup 和 HtmlUnit。Elements这个对象提供了一系列类似于DOM的方法来查找元素，抽取并处理其中的数据。