spark集群使用hanlp进行分布式分词操作

分两步:

第一步:实现hankcs.hanlp/corpus.io.IIOAdapter

public class HadoopFileIoAdapter implements IIOAdapter {

    @Override
    public InputStream open(String path) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(path), conf);
        return fs.open(new Path(path));
    }

    @Override
    public OutputStream create(String path) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(path), conf);
        OutputStream out = fs.create(new Path(path));
        return out;
    }
}

第二步:修改配置文件。root为hdfs上的数据包,把IOAdapter改为咱们上面实现的类

b

ok,这样你就能在分布式集群上使用hanlp进行分词了。

希望能帮到需要的朋友。。

猜你喜欢

转载自blog.csdn.net/qq_33872191/article/details/84963310