我正在对单词列表执行以下操作。我从Project Gutenberg文本文件中读取行,将每行用空格拆分,执行一般的标点符号替换,然后将每个单词和标点符号打印到各自的行中,以便以后进一步处理。我不确定如何将每个单引号替换为标记或排除所有撇号。我当前的方法是使用已编译的正则表达式:
apo = re.compile("[A-Za-z]'[A-Za-z]")
并执行以下操作:
if "'" in word and !apo.search(word): word = word.replace("'","\n<singlequote>")
但这忽略了在带有撇号的单词两边使用单引号的情况。它也不会向我表明单引号是否与单词结尾的单词开头相邻。
示例输入:
don't 'George ma'am end.' didn't.' 'Won't
示例输出(处理并打印到文件后):
don't <opensingle> George ma'am <period> <closesingle> didn't <period> <closesingle> <opensingle> Won't
关于这项任务,我确实有一个进一步的问题:既然 <opensingle> 和 <closesingle> 的区别似乎相当困难,执行如下替换是否更明智
<opensingle>
<closesingle>
word = word.replace('.','\n<period>') word = word.replace(',','\n<comma>')
在执行替换操作之后?
发布于 2018-06-10 04:58:36
我认为这可以从前视或后视引用中受益。python引用是 https://docs.python.org/3/library/re.html ,我经常引用的一个通用正则表达式站点是 https://www.regular-expressions.info/lookaround.html 。
您的数据:
words = ["don't", "'George", "ma'am", "end.'", "didn't.'", "'Won't",]
现在,我将使用正则表达式和它们的替代品来定义一个元组。
In [230]: apo = ( (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",), (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",), (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ), (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",), ...: ...: ...: ...: ...: ...: In [231]: words = ["don't", "'George", "ma'am", "end.'", "didn't.'", "'Won't",] ...: ...: ...: ...: ...: ...: In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words) Out[232]: ['don<apostrophe>t', '<opensingle>George', 'ma<apostrophe>am', 'end<period><closesingle>', 'didn<apostrophe>t<period><closesingle>', '<opensingle>Won<apostrophe>t']
下面是正则表达式的情况:
(?<=[A-Za-z])
(?=[A-Za-z])
(?<![A-Za-z])
(?![A-Za-z])
请注意,我在 <closesingle> 中添加了一个 . 检查,并且 apo 中的顺序很重要,因为您可能会用 <period> 替换 . ...
.
apo
<period>
这是对单个单词的操作,但也应该适用于句子。
In [233]: onelong = """ don't 'George ma'am end.' didn't.' 'Won't ...: ...: ...: ...: ...: ...: ...: In [235]: print( reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong) ...: ...: