HanLP 在汉字转拼音时,可以解决多音字问题,显示输出声调,声母、韵母,通过训练语料库,
本文代码为《自然语言处理入门》配套版本 HanLP-1.7.5
HanLP 里,汉字转简单,简体繁体转换,都用到了
双数组字典树 (Double-array Trie)
、
Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机
需要先熟悉
对
重载不是重任
进行转拼音,效果如下:
原文:重载不是重任
拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4,
拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn,
拼音(无音调):chong,zai,bu,shi,zhong,ren,
声调:2,3,2,4,4,4,
声母:ch,z,b,sh,zh,r,
韵母:ong,ai,u,i,ong,en,
输入法头:ch,z,b,sh,zh,r,
pinyin.txt
一丁点儿=yi1,ding1,dian3,er5
一不小心=yi1,bu4,xiao3,xin1
一丘之貉=yi1,qiu1,zhi1,he2
一丝不差=yi4,si1,bu4,cha1
一丝不苟=yi1,si1,bu4,gou3
一个=yi1,ge4
一个半个=yi1,ge4,ban4,ge4
一个巴掌拍不响=yi1,ge4,ba1,zhang3,pai1,bu4,xiang3
一个萝卜一个坑=yi1,ge4,luo2,bo5,yi1,ge4,keng1
一举两得=yi1,ju3,liang3,de2
一之为甚=yi1,zhi1,wei2,shen4
训练,生成 pinyin.txt.bin
加载语料库
HanLP-1.7.5\src\main\java\com\hankcs\hanlp\corpus\dictionary\SimpleDictionary.java
加载语料库,每行读取,按 =
分隔,放入字典 trie
中
根据 =
右边每个字的拼音,通过 Pinyin.valueOf("yi1")
得到枚举中声母、韵母、音调、包含音调的字符串形式、不含音调的字符串形式
public enum Pinyin
a1(Shengmu.none, Yunmu.a, 1, "ā", "a", Head.a, 'a'),
a2(Shengmu.none, Yunmu.a, 2, "á", "a", Head.a, 'a'),
a3(Shengmu.none, Yunmu.a, 3, "ǎ", "a", Head.a, 'a'),
a4(Shengmu.none, Yunmu.a, 4, "à", "a", Head.a, 'a'),
a5(Shengmu.none, Yunmu.a, 5, "a", "a", Head.a, 'a'),
ai1(Shengmu.none, Yunmu.ai, 1, "āi", "ai", Head.a, 'a'),
ai2(Shengmu.none, Yunmu.ai, 2, "ái", "ai", Head.a, 'a'),
ai3(Shengmu.none, Yunmu.ai, 3, "ǎi", "ai", Head.a, 'a'),
ai4(Shengmu.none, Yunmu.ai, 4, "ài", "ai", Head.a, 'a'),
......
将Map构建成双数组树`trie.build(map)``,可查看:HanLP — 双数组字典树 (Double-array Trie) 实现原理 -- 代码 + 图文,看不懂你来打我
public void build(TreeMap<String, V> map)
// 把值保存下来
v = (V[]) map.values().toArray();
l = new int[v.length];
Set<String> keySet = map.keySet();
// 构建二分trie树
addAllKeyword(keySet);
// 在二分trie树的基础上构建双数组trie树
buildDoubleArrayTrie(keySet);
used = null;
// 构建failure表并且合并output表
constructFailureStates();
rootState = null;
loseWeight();
通过 saveDat(path, trie, map.entrySet());
生成模型文件
static boolean saveDat(String path, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, Set<Map.Entry<String, Pinyin[]>> entrySet)
DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT)));
out.writeInt(entrySet.size());
for (Map.Entry<String, Pinyin[]> entry : entrySet)
Pinyin[] value = entry.getValue();
out.writeInt(value.length);
for (Pinyin pinyin : value)
out.writeInt(pinyin.ordinal());
trie.save(out);
out.close();
catch (Exception e)
logger.warning("缓存值dat" + path + "失败");
return false;
return true;
* 持久化
* @param out 一个DataOutputStream
* @throws Exception 可能的IO异常等
public void save(DataOutputStream out) throws Exception
out.writeInt(size);
for (int i = 0; i < size; i++)
out.writeInt(base[i]);
out.writeInt(check[i]);
out.writeInt(fail[i]);
int output[] = this.output[i];
if (output == null)
out.writeInt(0);
out.writeInt(output.length);
for (int o : output)
out.writeInt(o);
out.writeInt(l.length);
for (int length : l)
out.writeInt(length);
// path = data/dictionary/pinyin/pinyin.txt
static boolean loadDat(String path)
ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT);
if (byteArray == null) return false;
int size = byteArray.nextInt();
Pinyin[][] valueArray = new Pinyin[size][];
for (int i = 0; i < valueArray.length; ++i)
int length = byteArray.nextInt();
valueArray[i] = new Pinyin[length];
for (int j = 0; j < length; ++j)
valueArray[i][j] = pinyins[byteArray.nextInt()];
if (!trie.load(byteArray, valueArray)) return false;
return true;
public boolean load(ByteArray byteArray, V[] value)
if (byteArray == null) return false;
size = byteArray.nextInt();
base = new int[size + 65535]; // 多留一些,防止越界
check = new int[size + 65535];
fail = new int[size + 65535];
output = new int[size + 65535][];
int length;
for (int i = 0; i < size; ++i)
base[i] = byteArray.nextInt();
check[i] = byteArray.nextInt();
fail[i] = byteArray.nextInt();
length = byteArray.nextInt();
if (length == 0) continue;
output[i] = new int[length];
for (int j = 0; j < output[i].length; ++j)
output[i][j] = byteArray.nextInt();
length = byteArray.nextInt();
l = new int[length];
for (int i = 0; i < l.length; ++i)
l[i] = byteArray.nextInt();
v = value;
return true;
通过 HanLP — Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机 找出汉字的拼音
// HanLP-1.7.5\src\main\java\com\hankcs\hanlp\dictionary\py\PinyinDictionary.java
protected static List<Pinyin> segLongest(char[] charArray, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, boolean remainNone)
final Pinyin[][] wordNet = new Pinyin[charArray.length][];
trie.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<Pinyin[]>()
@Override
public void hit(int begin, int end, Pinyin[] value)
int length = end - begin;
if (wordNet[begin] == null || length > wordNet[begin].length)
wordNet[begin] = length == 1 ? new Pinyin[]{value[0]} : value;
List<Pinyin> pinyinList = new ArrayList<Pinyin>(charArray.length);
for (int offset = 0; offset < wordNet.length; )
if (wordNet[offset] == null)
if (remainNone)
pinyinList.add(Pinyin.none5);
++offset;
continue;
for (Pinyin pinyin : wordNet[offset])
pinyinList.add(pinyin);
offset += wordNet[offset].length;
return pinyinList;
public static void main(String[] args)
String text = "重载不是重任";
List<Pinyin> pinyinList = HanLP.convertToPinyinList(text);
System.out.print("原文:");
for (char c : text.toCharArray())
System.out.printf("%c", c);
System.out.println();
System.out.print("拼音(数字音调):");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin);
System.out.println();
System.out.print("拼音(符号音调):");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getPinyinWithToneMark());
System.out.println();
System.out.print("拼音(无音调):");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getPinyinWithoutTone());
System.out.println();
System.out.print("声调:");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getTone());
System.out.println();
System.out.print("声母:");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getShengmu());
System.out.println();
System.out.print("韵母:");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getYunmu());
System.out.println();
System.out.print("输入法头:");
for (Pinyin pinyin : pinyinList)
System.out.printf("%s,", pinyin.getHead());
System.out.println();
原文:重载不是重任
拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4,
拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn,
拼音(无音调):chong,zai,bu,shi,zhong,ren,
声调:2,3,2,4,4,4,
声母:ch,z,b,sh,zh,r,
韵母:ong,ai,u,i,ong,en,
输入法头:ch,z,b,sh,zh,r,
数据下载:http://download.hanlp.com/data-for-1.7.5.zip
本文来自博客园,作者:VipSoft 转载请注明原文链接:https://www.cnblogs.com/vipsoft/p/17972448