从Mongo里多线程取数据,我一开始尝试了用Executors.newFixedThreadPool线程池来实现。实际操作中,发现只有第一个线程会正常取数据,后建立的线程访问Mongo的cursor会报错。
后来改变了实现方式,通过Thread.start()来进行多线程取数据。代码如下:
RsUserTagsRunner job = new RsUserTagsRunner(start, end, subOver);
for (int i = 0; i < ConstantTool.MONGO_THREADS; i++) {
new Thread(job).start();
}
由于是用Thread.start()方法来执行的,而且是多线程共享一个RsUserTagsRunner 实例,所以逻辑上要与Runnable.run()方法不同。
多线程中,Thread.start()和Runnable.run()的不同请自行百度,资料很多,这里不再赘述。
所以就按照Mongo数据库中数据的时间来进行分片,进行分片读取数据。代码上就是通过getStart方法获取每次访问数据库时的时间分片。用synchronized 来保证多线程下,每次取时间分片时的数据一致性。
以下代码实现了多线程从MongoDB读取数据,并以固定大小写入HDFS。
public class RsUserTagsRunner implements Runnable {
static Logger log = Logger.getLogger(RsUserTagsRunner.class);
private long end;
private long nextStart;
private static final long timeClock = 1000 * 60 * 60;
private StringBuffer stringBuffer = new StringBuffer();
public static Boolean successFlag = null;
public static AtomicLong size = new AtomicLong(0L);
private long num = 1;
private long CurrentThread = 0;
private CountDownLatch over;
public RsUserTagsRunner(long start, long end, CountDownLatch over) {
this.end = end;
this.nextStart = start;
this.over = over;
}
synchronized private long getStart() {
long temp = nextStart;
nextStart = nextStart + timeClock;
return temp;
}
synchronized private long getNum() {
return num++;
}
@Override
public void run() {
MongoCursor<Document> cursor = null;
try {
CurrentThread++;
log.info("##### rs_user_tags 子线程获取数据开始");
List<String> index = TableIndex.map.get("rs_user_tags");
while (true) {
long tempStart = getStart();
if (tempStart < end) {
long tempEnd = tempStart + timeClock;
if (tempEnd > end) {
tempEnd = end;
}
log.info("##### rs_user_tags 子线程获取分片数据 [" + tempStart + ", " + tempEnd + ") 开始");
cursor = MongoDBDao.getRsUserTags(tempStart, tempEnd);
while (cursor.hasNext()) {
Document document = cursor.next();
size.addAndGet(1L);
String line = document.get(index.get(0)).toString();
for (int i = 1; i < index.size(); i++) {
Object columnOb = document.get(index.get(i));
String columnValue = "NULL";
if (columnOb != null) {
columnValue = columnOb.toString();
}
line = line + '\t' + columnValue;
}
write(line + '\n');
}
log.info("##### rs_user_tags 子线程获取分片数据 [" + tempStart + ", " + tempEnd + ") 结束,截止目前已取得 " + size
+ " 条数据");
} else {
break;
}
}
end();
} catch (Exception e) {
successFlag = false;
log.error("##### rs_user_tags 获取数据失败");
log.error(e.getMessage(), e);
} finally {
if (cursor != null) {
cursor.close();
}
if (successFlag != null) {
over.countDown();
}
}
}
synchronized private void write(String line) {
stringBuffer.append(line);
if (stringBuffer.length() > ConstantTool.HDFS_FLUSH_SIZE || line.isEmpty()) {
final String context = stringBuffer.toString();
new Runnable() {
public void run() {
long localNum = getNum();
log.info("##### 数据写入HDFS " + ConstantTool.OUTPUT_PATH + "/rs_user_tags_inc/tmp-" + localNum);
try {
Path path = new Path(ConstantTool.OUTPUT_PATH + "/rs_user_tags_inc/tmp-" + localNum);
FileSystem fs = HDFSTool.getFS();
if (fs.exists(path)) {
fs.delete(path, true);
}
InputStream in = new ByteArrayInputStream(context.getBytes());
OutputStream out = fs.create(path);
IOUtils.copyBytes(in, out, context.length(), true);
in.close();
out.close();
if (successFlag == null) {
successFlag = true;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}.run();
stringBuffer = new StringBuffer();
}
}
synchronized private void end() {
CurrentThread--;
if (CurrentThread == 0) {
write("");
}
}
注意用synchronized 方法修饰符对方法进行修饰,例如写方法,不synchronized 的话,多线程运行会造成结果数据与实际数据不一致。