正则表达式(Regex)嵌套标签提取 - Julia入门

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

爱运动的铁链 · 《权力的游戏》中夜王为何不杀山姆？背后的原因 ...· 1 月前 ·

冷冷的萝卜 · 如何在您的 Sony Android ...· 2 月前 ·

笑点低的小摩托 · WPF 入门笔记 - 01 - ...· 3 月前 ·

豪爽的菠萝 · 邢占清_百度百科· 3 月前 ·

玉树临风的乌龙茶 · 洁霸吸尘器BF500透明扒吸水扒头地毯沙发布 ...· 3 月前 ·

catch me is error

如何利用正则来提取这个 <div class="goal">.....</div> (内涵

若干)标签的内容…？

如何用Julialang v1.1.1实现提取？可以解释一下的话更好，谢谢

<div class=\"goal\"[^>]*>[^<>]*(((?'d'<div[^>]*>)[^<>]*)+((?'-d'</div>)[^<>]*)+)*(?(d)(?!))</div>

上面为搜到的表达式简单改写后的式子

或许使用html parser是最快的解决方案

但是，假设下次同类问题(非html标签)的话，就不能使用html parser了吧 _(:3

正则提取嵌套是各个语言之中通用的部分，似乎是真正学到点什么的地方

之前也有看到你提出的julia转义问题，上面的那个表达式在julia里面使用的话有转义或其他的问题吗？

如果那个式子完全不对的话，可以麻烦的实现一下吗？

Regex Tutorial - Backreferences To Match The Same Text Again

不过它给的例子貌似也不太好用 RegExr: Learn, Build, & Test RegEx 只是把最近的闭合标签匹配了.
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1> 这个貌似也不行。

不是很会用正则。

正则是真的不适合搞这个，简单点的文本匹配还行，这种成对的、还会递归嵌套的标签，还是上 parser 比较好。

I wrote an entire blog entry on this subject: Regular Expression Limitations

The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

Can you provide some examples of why it is hard to parse XML and HTML with a regex? - Stack Overflow

当然现在用的正则都加了额外的东西来满足各种奇奇怪怪的需求（有些人还认为 PCRE 是图灵完全的）。

@show eachmatch(sel"div.goal", n.root) # 1-element Array{HTMLNode,1}: # HTMLElement{:div}: # <div class="goal"not_only_these_elements=""> # <div> # something here # <div> # anything here # </div> # </div> # <div> # nothing here # </div> # </div> @show eachmatch(sel"div.goal > div", n.root) # 2-element Array{HTMLNode,1}: # HTMLElement{:div}: # <div> # something here # <div> # anything here # </div> # </div> # HTMLElement{:div}: # <div> # nothing here # </div> @show eachmatch(sel"div.goal > div", n.root)[2].children # 1-element Array{HTMLNode,1}: # HTML Text: nothing here print(eachmatch(sel"div.goal > div", n.root)[2].children[1]) # nothing here