Java实现DFA算法对敏感词、广告词过滤功能示例

一、前言

开发中经常要处理用户一些文字的提交，所以涉及到了敏感词过滤的功能，参考资料中dfa有穷状态机算法的实现，创建有向图。完成了对敏感词、广告词的过滤，而且效率较好，所以分享一下。

具体实现：

1、匹配大小写过滤
2、匹配全角半角过滤
3、匹配过滤停顿词过滤。
4、敏感词重复词过滤。

例如：

支持如下类型类型过滤检测：

fuck 全小写

fuck 大小写

ｆｕｃｋ全角半角

f!!!u&c ###k 停顿词

fffuuuucccckkk 重复词

敏感词过滤的做法有很多，我简单描述我现在理解的几种：

①查询数据库当中的敏感词，循环每一个敏感词，然后去输入的文本中从头到尾搜索一遍，看是否存在此敏感词，有则做相

应的处理，这种方式讲白了就是找到一个处理一个。

优点：so easy。用java代码实现基本没什么难度。

缺点：这效率让我心中奔过十万匹草泥马，而且匹配的是不是有些蛋疼，如果是英文时你会发现一个很无语的事情，比如英文

a是敏感词，那我如果是一篇英文文档，那程序它妹的得处理多少次敏感词？谁能告诉我？

②传说中的dfa算法（有穷自动机）,也正是我要给大家分享的，毕竟感觉比较通用，算法的原理希望大家能够自己去网上查查

资料，这里就不详细说明了。

优点：至少比上面那sb效率高点。

缺点：对于学过算法的应该不难，对于没学过算法的用起来也不难，就是理解起来有点gg疼，匹配效率也不高，比较耗费内存，

敏感词越多，内存占用的就越大。

③第三种在这里要特别说明一下，那就是你自己去写一个算法吧，或者在现有的算法的基础上去优化，这也是小alan追求的至

高境界之一，如果哪位淫兄有自己的想法一定别忘了小alan，可以加小alan的qq：810104041教小alan两招耍耍。

二、代码实现

其目录结构如下：

其中resources资源目录中：

stopwd.txt ：停顿词，匹配时间直接过滤。

wd.txt：敏感词库。

1、wordfilter敏感词过滤类

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

package org.andy.sensitivewdfilter;

import java.io.bufferedreader;

import java.io.ioexception;

import java.io.inputstreamreader;

import java.util.arraylist;

import java.util.hashmap;

import java.util.hashset;

import java.util.list;

import java.util.map;

import java.util.set;

import org.andy.sensitivewdfilter.util.bcconvert;

/**

* 创建时间：2016年8月30日下午3:01:12

*

* 思路：创建一个filterset，枚举了0~65535的所有char是否是某个敏感词开头的状态

*

* 判断是否是敏感词开头 | | 是不是获取头节点 ok--下一个字然后逐级遍历，dfa算法

*

* @author andy

* @version 2.2

*/

public class wordfilter {

private static final filterset set = new filterset(); // 存储首字

private static final map<integer, wordnode> nodes = new hashmap<integer, wordnode>(1024, 1); // 存储节点

private static final set<integer> stopwdset = new hashset<>(); // 停顿词

private static final char sign = '*'; // 敏感词过滤替换

static {

try {

long a = system.nanotime();

init();

a = system.nanotime() - a;

system.out.println("加载时间 : " + a + "ns");

system.out.println("加载时间 : " + a / 1000000 + "ms");

} catch (exception e) {

throw new runtimeexception("初始化过滤器失败");

}

private static void init() {

// 获取敏感词

addsensitiveword(readwordfromfile("wd.txt"));

addstopword(readwordfromfile("stopwd.txt"));

}

/**

* 增加敏感词

* @param path

* @return

*/

private static list<string> readwordfromfile(string path) {

list<string> words;

bufferedreader br = null;

try {

br = new bufferedreader(new inputstreamreader(wordfilter.class.getclassloader().getresourceasstream(path)));

words = new arraylist<string>(1200);

for (string buf = ""; (buf = br.readline()) != null;) {

if (buf == null || buf.trim().equals(""))

continue;

words.add(buf);

}

} catch (exception e) {

throw new runtimeexception(e);

} finally {

try {

if (br != null)

br.close();

} catch (ioexception e) {

}

return words;

}

/**

* 增加停顿词

*

* @param words

*/

private static void addstopword(final list<string> words) {

if (words != null && words.size() > 0) {

char[] chs;

for (string curr : words) {

chs = curr.tochararray();

for (char c : chs) {

stopwdset.add(charconvert(c));

}

/**

* 添加dfa节点

* @param words

*/

private static void addsensitiveword(final list<string> words) {

if (words != null && words.size() > 0) {

char[] chs;

int fchar;

int lastindex;

wordnode fnode; // 首字母节点

for (string curr : words) {

chs = curr.tochararray();

fchar = charconvert(chs[0]);

if (!set.contains(fchar)) {// 没有首字定义

set.add(fchar);// 首字标志位可重复add,反正判断了，不重复了

fnode = new wordnode(fchar, chs.length == 1);

nodes.put(fchar, fnode);

} else {

fnode = nodes.get(fchar);

if (!fnode.islast() && chs.length == 1)

fnode.setlast(true);

}

lastindex = chs.length - 1;

for (int i = 1; i < chs.length; i++) {

fnode = fnode.addifnoexist(charconvert(chs[i]), i == lastindex);

}

/**

* 过滤判断将敏感词转化为成屏蔽词

* @param src

* @return

*/

public static final string dofilter(final string src) {

char[] chs = src.tochararray();

int length = chs.length;

int currc;

int k;

wordnode node;

for (int i = 0; i < length; i++) {

currc = charconvert(chs[i]);

if (!set.contains(currc)) {

continue;

}

node = nodes.get(currc);// 日 2

if (node == null)// 其实不会发生，习惯性写上了

continue;

boolean couldmark = false;

int marknum = -1;

if (node.islast()) {// 单字匹配（日）

couldmark = true;

marknum = 0;

}

// 继续匹配（日你/日你妹），以长的优先

// 你-3 妹-4 夫-5

k = i;

for (; ++k < length;) {

int temp = charconvert(chs[k]);

if (stopwdset.contains(temp))

continue;

node = node.querysub(temp);

if (node == null)// 没有了

break;

if (node.islast()) {

couldmark = true;

marknum = k - i;// 3-2

}

if (couldmark) {

for (k = 0; k <= marknum; k++) {

chs[k + i] = sign;

}

i = i + marknum;

}

return new string(chs);

}

/**

* 是否包含敏感词

* @param src

* @return

*/

public static final boolean iscontains(final string src) {

char[] chs = src.tochararray();

int length = chs.length;

int currc;

int k;

wordnode node;

for (int i = 0; i < length; i++) {

currc = charconvert(chs[i]);

if (!set.contains(currc)) {

continue;

}

node = nodes.get(currc);// 日 2

if (node == null)// 其实不会发生，习惯性写上了

continue;

boolean couldmark = false;

if (node.islast()) {// 单字匹配（日）

couldmark = true;

}

// 继续匹配（日你/日你妹），以长的优先

// 你-3 妹-4 夫-5

k = i;

for (; ++k < length;) {

int temp = charconvert(chs[k]);

if (stopwdset.contains(temp))

continue;

node = node.querysub(temp);

if (node == null)// 没有了

break;

if (node.islast()) {

couldmark = true;

}

if (couldmark) {

return true;

}

return false;

}

/**

* 大写转化为小写全角转化为半角

*

* @param src

* @return

*/

private static int charconvert(char src) {

int r = bcconvert.qj2bj(src);

return (r >= 'a' && r <= 'z') ? r + 32 : r;

}

其中：

iscontains ：是否包含敏感词
dofilter：过滤敏感词

2、wordnode敏感词节点

package org.andy.sensitivewdfilter;

import java.util.linkedlist;

import java.util.list;

/**

* 创建时间：2016年8月30日下午3:07:45

*

* @author andy

* @version 2.2

*/

public class wordnode {

private int value; // 节点名称

private list<wordnode> subnodes; // 子节点

private boolean islast;// 默认false

public wordnode(int value) {

this.value = value;

}

public wordnode(int value, boolean islast) {

this.value = value;

this.islast = islast;

}

/**

*

* @param subnode

* @return 就是传入的subnode

*/

private wordnode addsubnode(final wordnode subnode) {

if (subnodes == null)

subnodes = new linkedlist<wordnode>();

subnodes.add(subnode);

return subnode;

}

/**

* 有就直接返回该子节点，没有就创建添加并返回该子节点

*

* @param value

* @return

*/

public wordnode addifnoexist(final int value, final boolean islast) {

if (subnodes == null) {

return addsubnode(new wordnode(value, islast));

}

for (wordnode subnode : subnodes) {

if (subnode.value == value) {

if (!subnode.islast && islast)

subnode.islast = true;

return subnode;

}

return addsubnode(new wordnode(value, islast));

}

public wordnode querysub(final int value) {

if (subnodes == null) {

return null;

}

for (wordnode subnode : subnodes) {

if (subnode.value == value)

return subnode;

}

return null;

}

public boolean islast() {

return islast;

}

public void setlast(boolean islast) {

this.islast = islast;

}

@override

public int hashcode() {

return value;

}

三、测试结果

项目包含敏感词库，源码，停顿词库等，只需运行maven打jar包直接可运行。

项目源码：sensitivewd-filter.rar

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持快网idc。

原文链接：http://blog.csdn.net/fengshizty/article/details/52373005

Java实现DFA算法对敏感词、广告词过滤功能示例

相关文章

快网idc优惠网

相关文章

微信

快网idc优惠网

QQ交流群