NoSQL数据库:MongoDB初探

时下炒得火热的NOSQL潮流，学习了一下mongodb，记录在此，希望与感兴趣的同学一起研究！

MongoDB概述

mongodb由C＋＋写就，其名字来自humongous这个单词的中间部分，是由10gen开发并维护的,关于它的一个最简洁描述为：scalable, high-performance, open source, schema-free, document-oriented database。MongoDB的主要目标是在键/值存储方式（提供了高性能和高度伸缩性）以及传统的RDBMS系统（丰富的功能）架起一座桥梁，集两者的优势于一身。

MongoDB特性：

l 面向文档存储

l 全索引支持,扩展到内部对象和内嵌数组

l 复制和高可用

l 自动分片支持云级扩展性

l 查询记录分析

l 动态查询

l 快速,就地更新

l 支持Map/Reduce操作

l GridFS文件系统

l 商业支持,培训和咨询

官网: http://www.mongodb.org/

配置

Master-slaves 模式

机器	IP	角色
test001	192.168.1.1	master
test002	192.168.1.2	slave
test003	192.168.1.3	slave
test004	192.168.1.4	slave
test005	192.168.1.5	slave
test006	192.168.1.6	slave

启动master:

 
        ./mongod -dbpath=/mongodb/data/ -logpath=/mongodb/logs/mongodb.log -oplogSize=10000 -logappend -master -port=27017 -fork

添加repl用户:

 
        ./mongo 
       
        >use local 
       
        > db.addUser('repl','replication');

启动slaves:

 
        ./mongod -dbpath=/mongodb/data/ -logpath=/mongodb/logs/mongodb.log -slave  -port=27017 -source=test001:27017 --autoresync 
       
        -fork

添加repl用户:

 
        ./mongo 
       
        >use local 
       
        > db.addUser('repl','replication');

autoresync 参数会在系统发生意外情况造成主从数据不同步时，自动启动复制操作 (同步复制 10 分钟内仅执行一次)。除此之外，还可以用 –slavedelay 设定更新频率(秒)。

通常我们会使用主从方案实现读写分离，但需要设置 Slave_OK。

slaveOk

When querying a replica pair or replica set, drivers route their requests to the master mongod by default; to perform a query against an (arbitrarily-selected) slave, the query can be run with the slaveOk option. Here’s how to do so in the shell:

db.getMongo().setSlaveOk(); // enable querying a slave
db.users.find(...)

Note: some language drivers permit specifying the slaveOk option on each find(), others make this a connection-wide setting. See your language’s driver for details.

Replica Set模式

Replica Sets 使用 n 个 Mongod 节点，构建具备自动容错转移(auto-failover)、自动恢复(auto-recovery) 的高可用方案。

机器	IP	角色
test001	192.168.1.1	secondary
test002	192.168.1.2	secondary
test003	192.168.1.3	primary
test004	192.168.1.4	secondary
test005	192.168.1.5	secondary
test006	192.168.1.6	secondary
test007	192.168.1.7	secondary

启动:

 
        ./mongod -dbpath=/mongodb/data/ -logpath=/mongodb/logs/mongodb.log -oplogSize=10000 -logappend -replSet set1 -port=27017 -fork –rest

添加repl用户:

 
        ./mongo 
       
        >use local 
       
        > db.addUser('repl','replication');

配置:

 
        config={_id:'set1',members:[ 
       
        {_id:0,host:'test001:27017'}, 
       
        {_id:1,host:'test002:27017'}, 
       
        {_id:2,host:'test003:27017'}, 
       
        {_id:3,host:'test004:27017'}, 
       
        {_id:4,host:'test005:27017'}, 
       
        {_id:5,host:'test006:27017'}, 
       
        {_id:6,host:'test007:27017'}] 
       
        } 
       
        rs.initiate(config);

查看:

访问 http://test001 :28017/_replSet

或者

 
        ./mongo 
       
        > rs.status() 
       
        { 
       
        "set" : "set1", 
       
        "date" : "Fri Dec 03 2010 00:57:44 GMT+0800 (CST)", 
       
        "myState" : 2, 
       
        "members" : [ 
       
        { 
       
        "_id" : 0, 
       
        "name" : "test001:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "self" : true 
       
        }, 
       
        { 
       
        "_id" : 1, 
       
        "name" : "test002:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "uptime" : 194451, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:42 GMT+0800 (CST)" 
       
        }, 
       
        { 
       
        "_id" : 2, 
       
        "name" : "test003:27017", 
       
        "health" : 1, 
       
        "state" : 1, 
       
        "uptime" : 194689, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:43 GMT+0800 (CST)" 
       
        }, 
       
        { 
       
        "_id" : 3, 
       
        "name" : "test004:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "uptime" : 194689, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:42 GMT+0800 (CST)" 
       
        }, 
       
        { 
       
        "_id" : 4, 
       
        "name" : "test005:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "uptime" : 194689, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:42 GMT+0800 (CST)" 
       
        }, 
       
        { 
       
        "_id" : 5, 
       
        "name" : "test006:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "uptime" : 194689, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:43 GMT+0800 (CST)" 
       
        }, 
       
        { 
       
        "_id" : 6, 
       
        "name" : "test007:27017", 
       
        "health" : 1, 
       
        "state" : 2, 
       
        "uptime" : 194689, 
       
        "lastHeartbeat" : "Fri Dec 03 2010 00:57:42 GMT+0800 (CST)" 
       
        } 
       
        ], 
       
        "ok" : 1 
       
        }

在Replica Sets上做操作后调用getlasterror使写操作同步到至少3台机器后才返回

db.runCommand( { getlasterror : 1 , w : 3 } )

注：该模式不支持auth功能，需要auth功能请选择m-s模式

Sharding模式

要构建一个 MongoDB Sharding Cluster，需要三种角色：

Shard Server: mongod 实例，用于存储实际的数据块。
Config Server: mongod 实例，存储了整个 Cluster Metadata，其中包括 chunk 信息。
Route Server: mongos 实例，前端路由，客户端由此接入，且让整个集群看上去像单一进程数据库。

机器	IP	角色
test002	192.168.1.2	mongod shard11:27017
test003	192.168.1.3	mongod shard21:27017
test004	192.168.1.4	mongod shard31:27017
test005	192.168.1.5	mongod config1:20000 mongs1:30000
test006	192.168.1.6	mongod config2:20000 mongs2:30000
test007	192.168.1.7	mongod config3:20000 mongs3:30000
test008	192.168.1.8	mongod shard12:27017
test009	192.168.1.9	mongod shard22:27017
test010	192.168.1.10	mongod shard32:27017

Shard配置

Shard1

[test002; test008]

test002:

 
        ./mongod -shardsvr -replSet shard1 -port 27017 -dbpath /mongodb/data/shard11 -oplogSize 10000 -logpath /mongodb/logs/shard11.log -logappend -fork

test008:

 
        ./mongod -shardsvr -replSet shard1 -port 27017 -dbpath /mongodb/data/shard12 -oplogSize 10000 -logpath /mongodb/logs/shard12.log -logappend -fork

初始化shard1

 
        config={_id:'shard1',members:[ 
       
        {_id:0,host:'test002:27017'}, 
       
        {_id:1,host:'test008:27017'}] 
       
        } 
       
        rs.initiate(config);

Shard2

[test003; test009]

test003:

 
        ./mongod -shardsvr -replSet shard2 -port 27017 -dbpath /mongodb/data/shard21 -oplogSize 10000 -logpath /mongodb/logs/shard21.log -logappend -fork

test009:

 
        ./mongod -shardsvr -replSet shard2 -port 27017 -dbpath /mongodb/data/shard22 -oplogSize 10000 -logpath /mongodb/logs/shard22.log -logappend -fork

初始化shard2

 
        config={_id:'shard2',members:[ 
       
        {_id:0,host:'test003:27017'}, 
       
        {_id:1,host:'test009:27017'}] 
       
        } 
       
        rs.initiate(config);

Shard3

[test004; test010]

test004:

 
        ./mongod -shardsvr -replSet shard3 -port 27017 -dbpath /mongodb/data/shard31 -oplogSize 10000 -logpath /mongodb/logs/shard31.log -logappend -fork

test010:

 
        ./mongod -shardsvr -replSet shard3 -port 27017 -dbpath /mongodb/data/shard32 -oplogSize 10000 -logpath /mongodb/logs/shard32.log -logappend -fork

初始化shard3

 
        config={_id:'shard3',members:[ 
       
        {_id:0,host:'test004:27017'}, 
       
        {_id:1,host:'test010:27017'}] 
       
        } 
       
        rs.initiate(config);

config server配置

[test005; test006; test007]

 
        ./mongod -configsvr -dbpath /mongodb/data/config -port 20000 -logpath /mongodb/logs/config.log -logappend -fork

Mongos配置

[test005; test006; test007]

 
        ./mongos -configdb test005:20000,test006:20000,test007:20000 -port 30000 -chunkSize 5 -logpath /mongodb/logs/mongos.log -logappend -fork

Route 转发请求到实际的目标服务进程，并将多个结果合并回传给客户端。Route 本身并不存储任何数据和状态，仅在启动时从 Config Server 获取信息。Config Server 上的任何变动都会传递给所有的 Route Process。

Configuring the Shard Cluster

1. 连接admin数据库

 
        ./mongo test005:30000/admin

2. 加入shards

 
        db.runCommand({addshard:"shard1/test002:27017,test008:27017",name:"s1",maxsize:20480}); 
       
        db.runCommand({addshard:"shard2/test003:27017,test009:27017",name:"s2",maxsize:20480}); 
       
        db.runCommand({addshard:"shard3/test004:27017,test010:27017",name:"s3",maxsize:20480});

3. Listing shards

 
        db.runCommand({listshards:1})

如果列出了以上3个shards，表示shards已经配置成功

4. 激活数据库和表分片

 
        db.runCommand({enablesharding:"taobao"}); 
       
        db.runCommand({shardcollection:"taobao.test0",key:{_id:1}}); db.runCommand({shardcollection:"taobao.test1",key:{_id:1}});

使用

shell操作数据库

超级用户相关：

1) 进入数据库admin

 
        use admin

2) 增加或修改用户密码

 
        db.addUser('name','pwd')

3) 查看用户列表

 
        db.system.users.find()

4) 用户认证

 
        db.auth('name','pwd')

5) 删除用户

 
        db.removeUser('name')

6) 查看所有用户

 
        show users

7) 查看所有数据库

 
        show dbs

8) 查看所有的collection

 
        show collections

9) 查看各collection的状态

 
        db.printCollectionStats()

10) 查看主从复制状态

 
        db.printReplicationInfo()

11) 修复数据库

 
        db.repairDatabase()

12) 设置记录profiling，0=off 1=slow 2=all

 
        db.setProfilingLevel(1)

13) 查看profiling

 
        show profile

14) 拷贝数据库

 
        db.copyDatabase('mail_addr','mail_addr_tmp')

15) 删除collection

 
        db.mail_addr.drop()

16) 删除当前的数据库

 
        db.dropDatabase()

增加删除修改:

1) Insert

 
        db.user.insert({'name':'dump','age':1}) 
       
        or 
       
        db.user.save({'name':'dump','age':1})

嵌套对象:

 
        db.foo.save({'name':'dump','address':{'city':'hangzhou','post':310015},'phone':[138888888,13999999999]})

数组对象:

1	`db.user_addr.save({'Uid':'dump','Al':['[email protected]','[email protected]']})`

2) delete

删除name=’dump’的用户信息:

 
        db.user.remove({'name':'dump'})

删除foo表所有信息:

 
        db.foo.remove()

3) update

//update foo set xx=4 where yy=6

//如果不存在则插入，允许修改多条记录

 
        db.foo.update({'yy':6},{'$set':{'xx':4}},upsert=true,multi=true)

查询:

 
        coll.find() // select * from coll 
       
        coll.find().limit(10) // select * from coll limit 10 
       
        coll.find().sort({x:1}) // select * from coll order by x asc 
       
        coll.find().sort({x:1}).skip(5).limit(10) // select * from coll order by x asc limit 5, 10 
       
        coll.find({x:10}) // select * from coll where x = 10 
       
        coll.find({x: {$lt:10}}) // select * from coll where x <= 10 
       
        coll.find({}, {y:true}) // select y from coll 
       
        coll.count() //select count(*) from coll

其他:

 
        coll.find({"address.city":"gz"}) // 搜索嵌套文档address中city值为gz的记录 
       
        coll.find({likes:"math"}) // 搜索数组 
       
        coll.find({name: {$exists: true}}); //查询所有存在name字段的记录 
       
        coll.find({phone: {$exists: false}}); //查询所有不存在phone字段的记录 
       
        coll.find({name: {$type: 2}}); //查询所有name字段是字符类型的coll.find({age: {$type: 16}}); //查询所有age字段是整型的

索引:

1(ascending),-1(descending)

 
        coll.ensureIndex({productid:1}) // 在productid上建立普通索引 
       
        coll.ensureIndex({district:1, plate:1}) // 多字段索引 
       
        coll.ensureIndex({"address.city":1}) // 在嵌套文档的字段上建索引 
       
        coll.ensureIndex({productid:1}, {unique:true}) // 唯一索引 
       
        coll.ensureIndex({productid:1}, {unique:true, dropDups:true|) // 建索引时，如果遇到索引字段值已经出现过的情况，则删除重复记录 
       
        coll.getIndexes() // 查看索引 
       
        coll.dropIndex({productid:1}) // 删除单个索引

MongoDB Drivers

Mongodb支持的client 编程api非常多，由于dump中心是建立在hadoop的基础上的，所以着重介绍java api,后面的测试程序采用的也是java api.

MongoDB in Java

下载MongoDB的Java驱动，把jar包(mongo-2.3.jar)扔到项目里去就行了，

Java中，Mongo对象是线程安全的，一个应用中应该只使用一个Mongo对象。Mongo对象会自动维护一个连接池，默认连接数为10。

 
        import com.mongodb.* 
       
        try{ 
       
        Mongo mg = new Mongo(server_lists);// List<ServerAddress> server _lists 
       
        DB db = mg.getDB("taobao"); 
       
        if (db.isAuthenticated() == false) { 
       
        db.authenticate("name", "pwd".toCharArray()); 
       
        } 
       
        DBCollection coll=db.getCollection("category_property_values"); 
       
        coll.slaveOk();//repl set模式必须调用，否则所有query将只发到主节点查询 
       
        //insert 
       
        BasicDBObject doc = <strong>new</strong> BasicDBObject(); 
       
        //赋值 
       
        doc.put("name", "MongoDB"); 
       
        doc.put("type", "database"); 
       
        coll.insert(doc); 
       
        …… 
       
        //select 
       
        //查询一条数据 
       
        BasicDBObject doc = <strong>new</strong> BasicDBObject(); 
       
        doc.put("name", "MongoDB"); 
       
        DBObject query = coll.findOne(doc); 
       
        …… 
       
        //使用游标查询 
       
        DBCursor cur = coll.find(doc); 
       
        while(cur.hasNext()) { 
       
        cur.next(); 
       
        …… 
       
        } 
       
        …… 
       
        //update 
       
        DBObject dblist = new BasicDBObject(); 
       
        DBObject qlist = new BasicDBObject(); 
       
        qlist.put("_id", j); 
       
        dblist.put("t1", str); 
       
        coll.update(qlist, dblist); 
       
        …… 
       
        //delete 
       
        DBObject dlist = new BasicDBObject(); 
       
        dlist.put("_id", j); 
       
        coll.remove(dlist); 
       
        }catch(MongoException ex){ 
       
        }

MongoDB 测试

测试版本: 1.6.3

采用单线程分别插入100万，300万,500万,1000万数据和多个线程，每线程插入100万数据.

插入数据格式:

 
        { "_id" : NumberLong(16), "nid" : NumberLong(16), "t1" : "search_engine_insert", "t2" : "search_engine_insert", "t3" : "search_engine_insert", "t4" : "search_engine_insert" }

1) Master slaves模式

Insert

Per-thread rows	run time	Per-thread insert	Total-insert	Total rows	threads
1000000	20	50000	50000	1000000	1
3000000	60	50000	50000	3000000	1
5000000	99	50505	50505	5000000	1
8000000	159	50314	50314	8000000	1
10000000	208	48076	48076	10000000	1
1000000	64	15625	31250	2000000	2

Mongodb只有主节点才能进行插入和更新操作.

Update

数据格式:

 
        { "_id" : NumberLong(16), "nid" : NumberLong(16), "t1" : "search_engine_update", "t2" : "search_engine_update", "t3" : "search_engine_update", "t4" : "search_engine_update" }

Per-thread rows	run time	Per-thread update	Total-update	Total rows	threads
1000000	96	10416	10416	1000000	1
3000000	287	10452	10452	3000000	1
1000000	188	5319	15957	3000000	3
1000000	351	2849	14245	5000000	5

Select

以”_id”字段为key，返回整条记录

a) 客户端:单机多线程

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	72	13888	13888	1000000	1
1000000	129	7751	77519	10000000	10
1000000	554	1805	90252	50000000	50
1000000	1121	892	89206	100000000	100
1000000	2256	443	88652	200000000	200

b) 客户端:分布式多线程

程序部署在39台机器上

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	173	5780	*578039=223470**	1000000*39	1
1000000	1402	713	*713239=278148**	10000000*39	10
500000	1406	355	*711239=277368**	10000000*39	20
200000	1433	139	*697839=272142**	10000000*39	50

2) Replica Set 模式

Insert

Per-thread rows	run time	Per-thread insert	Total-insert	Total rows	threads
1000000	40	25000	25000	1000000	1
3000000	117	25641	25641	3000000	1
5000000	211	23696	23696	5000000	1
8000000	289	27681	27681	8000000	1
10000000	388	25773	25773	10000000	1
1000000	83	12048	24096	2000000	2
1000000	210	4762	23809	5000000	5

Update

Per-thread rows	run time	Per-thread update	Total-update	Total rows	threads
1000000	28	35714	35714	1000000	1
3000000	83	36144	36144	3000000	1
1000000	146	6849	20547	3000000	3
1000000	262	3816	19083	5000000	5

Select

以”_id”字段为key，返回整条记录

a) 客户端:单机多线程

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	198	5050	5050	1000000	1
1000000	264	3787	37878	10000000	10
1000000	436	2293	114678	50000000	50
1000000	754	1326	132625	100000000	100
1000000	1526	655	131061	200000000	200

b) 客户端:分布式多线程

程序部署在39台机器上

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	216	4629	*462939=180531**	1000000*39	1
1000000	1375	729	*729339=284427**	10000000*39	10
500000	1469	340	*680739=265473**	10000000*39	20
200000	1561	128	*640639=249834**	10000000*39	50

3) Sharding 模式

Insert

Per-thread rows	run time	Per-thread insert	Total-insert	Total rows	threads
1000000	58	17241	17241	1000000	1
3000000	180	16666	16666	3000000	1
5000000	373	13404	13404	5000000	1
2000000	234	8547	17094	4000000	2
2000000	447	4474	22371	10000000	5

Update

Per-thread rows	run time	Per-thread update	Total-update	Total rows	threads
1000000	38	26315	26315	1000000	1
3000000	115	26086	26086	3000000	1
1000000	64	15625	46875	3000000	3
1000000	93	10752	53763	5000000	5

Select

以”_id”字段为key，返回整条记录

a) 客户端:单机多线程

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	277	3610	3610	1000000	1
1000000	456	2192	21929	10000000	10
1000000	1158	863	43177	50000000	50
1000000	2299	434	43497	100000000	100

b) 客户端:分布式多线程

程序部署在39台机器上

Per-thread rows	run time	Per-thread select	Total-select	Total rows	threads
1000000	659	1517	*151739= 59163**	1000000*39	1
1000000	8540	117	*117039=45630**	10000000*39	10

小结:

Mongodb在M-S和Repl-Set模式下查询效率还是不错的，区别在于Repl-Set模式如果有primary节点挂掉，系统自己会选举出另一个primary节点，不会影响后续的使用，原来的主节点恢复后自动成为secondary节点,而M-S模式一旦master 节点挂掉需要手工将别的slaves 节点修改成master,另外Repl-Set模式最多只能有7个节点.

由于sharding模式查询速度下降明显，耗时太长,所以只测试了2轮,估计他的威力应该在数据量非常大的环境下才能体现出来吧,以上数据仅供参考，现在只是简单的进行了测试，接下来会对源码进行一下研究，欢迎和感兴趣的同学多多交流！