`
michael8335
  • 浏览: 187546 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

MongoDb Architecture

 
阅读更多

NOSQL has become a very heated topic for large web-scale deployment where scalability and semi-structured data driven the DB requirement towards NOSQL. There has been many NOSQL products evolving in over last couple years. In my past blogs, I have been covering the underlying distributed system theory of NOSQL, as well as some specific products such as CouchDB and Cassandra/HBase.

Last Friday I was very lucky to meet with Jared Rosoff from 10gen in a technical conference and have a discussion about the technical architecture of MongoDb. I found the information is very useful and want to share with more people.

One thing I am very impressed by MongoDb is that it is extremely easy to use and the underlying architecture is also very easy to understand.

Here are some simple admin steps to start/stop MongoDb server
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

# Install MongoDB

mkdir /data/lib

 

# Start Mongod server

.../bin/mongod # data stored in /data/db

 

# Start the command shell

.../bin/mongo

> show dbs

> show collections

 

# Remove collection

> db.person.drop()

 

# Stop the Mongod server from shell

> use admin

> db.shutdownServer()


Major difference from RDBMS
MongoDb differs from RDBMS in the following way
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

  • Unlike an RDBMS record which is "flat" (a fixed number of simple data type), the basic unit of MongoDb is "document" which is "nested" and can contain multi-value fields (arrays, hash).
  • Unlike RDBMS where all records stored in a table must be confined to the table schema, documents of any structure can be stored in the same collection.
  • There is no "join" operation in the query. Overall, data is encouraged to be organized in a more denormalized manner and the more burden of ensuring data consistency is pushed to the application developers
  • There is no concept of "transaction" in MongoDb. "Atomicity" is guaranteed only at the document level (no partial update of a document will occurred).
  • There is no concept of "isolation", any data read by one client may have its value modified by another concurrent client.

By removing some of those features that a classical RDBMS will provide, MongoDb can be more light-weight and be more scalable in processing big data.
Query processingMongoDb belongs to the type of document-oriented DB. In this model, data is organized as JSON document, and store into a collection. Collection can be thought for equivalent to Table and Document is equivalent to records in RDBMS world.

Here are some basic example.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

# create a doc and save into a collection

> p = {firstname:"Dave", lastname:"Ho"}

> db.person.save(p)

> db.person.insert({firstname:"Ricky", lastname:"Ho"})

 

# Show all docs within a collection

> db.person.find()

 

# Iterate result using cursor

> var c = db.person.find()

> p1 = c.next()

> p2 = c.next()


To specify the search criteria, an example document containing the fields that needs to match against need to be provided.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

> p3 = db.person.findone({lastname:"Ho"})

Notice that in the query, the value portion need to be determined before the query is made (in other words, it cannot be based on other attributes of the document). For example, lets say if we have a collection of "Person", it is not possible to express a query that return person whose weight is larger than 10 times of their height.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

# Return a subset of fields (ie: projection)

> db.person.find({lastname:"Ho"}, {firstname:true})

 

# Delete some records

> db.person.remove({firstname:"Ricky"})

To speed up the query, index can be used. In MongoDb, index is stored as a BTree structure (so range query is automatically supported). Since the document itself is a tree, the index can be specified as a path and drill into deep nesting level inside the document.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

# To build an index for a collection

> db.person.ensureIndex({firstname:1})

 

# To show all existing indexes

> db.person.getIndexes()

 

# To remove an index

> db.person.dropIndex({firstname:1})

 

# Index can be build on a path of the doc.

> db.person.ensureIndex({"address.city":1})

 

# A composite key can be used to build index

> db.person.ensureIndex({lastname:1, firstname:1})

Index can also be build on an multi-valued attribute such as an array. In this case, each element in the array will have a separate node in the BTree.

Building an index can be done in both offline foreground mode or online background mode. Foreground mode will proceed much faster but the DB cannot be access during the build index period. If the system is running in a replica set (describe below), it is recommended to rotate each member DB offline and build the index in foreground.

When there are multiple selection criteria in a query, MongoDb attempts to use one single best index to select a candidate set and then sequentially iterate through them to evaluate other criteria.

When there are multiple indexes available for a collection. When handling a query the first time, MongoDb will create multiple execution plans (one for each available index) and let them take turns (within certain number of ticks) to execute until the fastest plan finishes. The result of the fastest executor will be returned and the system remembers the corresponding index used by the fastest executor. Subsequent query will use the remembered index until certain number of updates has happened in the collection, then the system repeats the process to figure out what is the best index at that time.

Since only one index will be used, it is important to look at the search or sorting criteria of the query and build additional composite index to match the query better. Maintaining an index is not without cost as index need to be updated when docs are created, deleted and updated, which incurs overhead to the update operations. To maintain an optimal balance, we need to periodically measure the effectiveness of having an index (e.g. the read/write ratio) and delete less efficient indexes.

Storage Model
Written in C++, MongoDB uses a memory map file that directly map an on-disk data file to in-memory byte array where data access logic is implemented using pointer arithmetic. Each document collection is stored in one namespace file (which contains metadata information) as well as multiple extent data files (with an exponentially/doubling increasing size).


The data structure uses doubly-linked-list extensively. Each collection of data is organized in a linked list of extents each of which represents a contiguous disk space. Each extent points to a head/tail of another linked list of docs. Each doc contains a linked list to other documents as well as the actual data encoded in BSON format.

Data modification happens in place. In case the modification increases the size of record beyond its originally allocated space, the whole record will be moved to a bigger region with some extra padding bytes. The padding bytes is used as an growth buffer so that future expansion doesn't necessary require moving the data again. The amount of padding is dynamically adjusted per collection based on its modification statistics. On the other hand, the space occupied by the original doc will be free up. This is kept tracked by a list of free list of different size.

As we can imagine holes will be created over time as objects are created, deleted or modified, this fragmentation will hurt performance as less data is being read/write per disk I/O. Therefore, we need to run the "compact" command periodically, which copy the data to a contiguous space. This "compact" operation however is an exclusive operation and has to be done offline. Typically this is done in a replica setting by rotating each member offline one at a time to perform the compaction.

Index are implemented as BTree. Each BTree node contains a number of keys (within this node), as well as pointers to left children BTree nodes of each key.

Data update and Transaction
To update an existing doc, we can do the following
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

var p1 = db.person.findone({lastname:"Ho"})

p1["address"] = "San Jose"

db.person.save(p1)

 

# Do the same in one command

db.person.update({lastname:"Ho"},

              {$set:{address:"San Jose"}},

              false,

              true)

Write by default doesn't wait. There are various wait options that the client can specified what conditions to wait before the call returns (this can also achievable by a followup "getlasterror" call), such as where the changes is persisted on disk, or changes has been propagated to sufficient members in the replica set. MongoDb also provides a sophisticated way to assign tags to members of replica set to reflect their physical topology so that customized write policy for each collection can be made based on their reliability requirement.

In RDBMS, "Serializability" is a very fundamental concept about the net effect of concurrently executing work units is equivalent to as if these work units are arrange in some order of sequential execution (one at a time). Therefore, each client can treat as if the DB is exclusively available. The underlying implementation of DB server many use LOCKs or Multi-version to provide the isolation. However, this concept is not available in MongoDb (and many other NOSQL as well)

In MongoDb, every data you read should be treated as a snapshot of the past, which means by the time you look at it, it may have been changed in the DB. Therefore, if you are making a modification based on some previously read data, by the time you send the modification request, the condition where your modification is based on may have changed. If this is not acceptable for your application's consistency requirement, you may need to re-validate the condition at the time you request the modification (ie: a "conditional_modify" should be made).

Under this scheme, a "condition" is attached along with the modification request so that the DB server can validate the condition before applying the modification. (of course, the condition checking and modification must be atomic so no update can happen in between). In MongoDb, this can be achieved by the "findAndModify" call.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

var account = db.bank.findone({id:1234})

var old_bal = account['balance']

var new_bal = old_bal + fund

 

# Pre-condition is specified in search criteria

db.bank.findAndModify({id:1234, balance:old_bal},

                   {$set: {balance: new_bal}})

 

# Check if the prev command successfully

var success =

   db.runCommand({getlasterror:1,j:true})

 

if (!success) {

 #retry_from_beginning

}

The concept of "transaction" is also missing in MongoDb. While MongoDb guarantee each document will be atomically modified (so no partial update will happen within a doc), but if the update modifies multiple documents, then there are no guarantee on the atomicity across documents.

Therefore, it is the application developers responsibility to implement the multi-update atomicity across multiple documents. We describe a common design pattern to achieve that. This technique is not specific to MongoDb and applicable to other NOSQL store, which can at least guarantee atomicity at the single record level.

The basic idea is to first create a separate document (called transaction) that links together all the documents that you want to modify. And then create a reverse link from each document (to be modified) back to the transaction. By carefully design the sequence of update in the documents and the transaction, we can achieve the atomicity of modifying multiple documents.


MongoDb's web site has also described
 a similar technique here (based on the same concept but the implementation is slightly different).

Replication Model
High availability is achieved in MongoDb via Replica Set, which provides data redundancy across multiple physical servers, including a single primary DB as well as multiple secondary DBs. For data consistency, all modifications (insert / update / deletes) request go to the primary DB where modification is made and asynchronously replicated to the other secondary DBs.


Within the replica set, members are interconnected with each other to exchange heartbeat message. A crashed server with missing heartbeat will be detected by other members and removed from the replica set membership. After the dead secondary recovers in future, it can rejoin the cluster by connecting to the primary to catchup the latest update since its last crashed. If the crashes happens in a lengthy period of time where the change log from the primary doesn't cover the whole crash period, then the recovered secondary need to reload the whole data from the primary (as if it was a brand new server).

In case of the primary DB crashes, a leader election protocol will be run among the remaining members to nominate the new primary, based on many factors such as the node priority, node uptime ... etc. After getting majority vote, the new primary server will take place. Notice that due to async replication, the newly elected primary DB doesn't necessary having all the latest updated from the crashed DB.

Client lib provides the API for the App to access the MongoDB server. At startup, the client lib will connect to some member (based on a seed list) of the Replica set and issue a "isMaster" command to gather the current picture of the set (who is the primary and secondaries). After that, the client lib connect to the single primary (where it will send all DB modification request) and some number of secondaries (where it will send read-only queries). The client library will periodically re-run the "isMaster" command to detect if any new members has joined the set. When an existing member in the set is crashed, connections to all existing clients will be dropped and forces a resynchronization of the latest picture.

There is also a special secondary DB called slave delay, which guarantee the data is propagated with a certain time lag with the master. This is used mainly to recover data after accidental deletion of data.

For data modification request, the client will send the request to the primary DB, by default the request will be returned once written to the primary, an optional parameter can be specified to indicate a certain number of secondaries need to receive the modification before return so the client can ensure the majority portion of members have got the request. Of course there is a tradeoff between latency and reliability.

For query request, by default the client will contact the primary which has the latest updated data. Optionally, the client can specify its willingness to read from any secondaries, and tolerate that the returned data may be outdated. This provide an opportunity to load balance the read request across all secondaries. Notice that in this case, a subsequent read following a write may not seen the update.

For read-mostly application, reading form any secondaries can be a big performance improvement. To select the fastest secondary DB member to issue query, the client driver periodically pings the members and will favor issuing the query to the one with lowest latency. Notice that read request is issued to only one node, there is no quorum read or read from multiple nodes in MongoDb.

The main purpose of Replica set is to provide data redundancy and also load balance read-request. It doesn't provide load balancing for write-request since all modification still has to go to the single master.

Another benefit of replica set is that members can be taken offline on an rotation basis to perform expensive operation such as compaction, indexing or backup, without impacting online clients using the alive members.

Sharding Model
To load balance write-request, we can use MongoDb shards. In the sharding setup, a collection can be partitioned (by a partition key) into chunks (which is a key range) and have chunks distributed across multiple shards (each shard will be a replica set). MongoDb sharding effectively provide an unlimited size for data collection, which is important for any big data scenario.

To reiterate, in the sharding model, a single partition key will be specified for each collection that is stored in the sharding cluster. The key space of the partition key is divided into contiguous key range called chunk, which is hosted by corresponding shards.


<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

# To define the partition key

db.runcommand({shardcollection: "testdb.person",

         key: {firstname:1, lastname:1}})

In the shard setting, the client lib connects to a stateless routing server "MongoS", which behaves as if the "MongoD". The routing server plays an important role in forwarding the client request to the appropriate shard server according to the request characteristics.


For insert/delete/update request containing the partition key, based on the chunk/shard mapping information (obtained from the config Server and cache locally), the route server will forward the request to the corresponding primary server hosting the chunk whose key range covers the partition key of the modified doc. Given a particular partition key, the primary server containing that chunk can be unambiguously determined.

In case of query request, the routing server will examine whether the partition key is part of the selection criteria and if so will only "route" the request to the corresponding shard (primary or secondary). However, if the partition key is not part of the selection criteria, then the routing server will "scatter" the request to every shard (pick one member of each shard) which will compute its local search, and the result will be gathered at the routing server and return to the client. When the query requires the result to be sorted, and if the partition key is involved in the sort order, the routing server will route the request sequentially to multiple shards as the client iterate the result. In case the sort involves other key which is not the partition key, the router server will scatter the request to all shards which will perform its local sort, and then merge the result at the routing server (basically a distributed merge-sort).

As data are inserted into chunk and get close to its full capacity, we need to split the chunk. The routing server can detect this situation statistically based on the number of requests it forward as well as the number of other routing server exist. Then the routing server will contact the primary server of the shard that contains the full chunk and request for a chunk split. The shard server will compute the mid point of the key range that can evenly distribute the data and then split the chunk and update the configuration server about its split point. Notice that so far there is no data movement as data is still residing in the same shard server.

On the other hand, there is another "balancer" process (running in one of the routing server) whose job is to make sure each shard carry approximately same number of chunks. When the unbalance condition is detected, the balancer will contact the busy shard to trigger a chunk migration process. This migration process happens online where the origination contacts the destination to initiate a data transfer, and data will start to be copied from the origination to the destination. This process may take some time (depends on the data volume) during which update can happen continuously at the origination. These changes will be tracked at the origination and when the copy finishes, delta will then transfer and applied to the destination. After multiple rounds of applying deltas, the migration enters the final round and the origination will halt and withhold all request coming from the routing server. After the last round of changes have been applied to the destination, the destination will update the configuration server about the new shard configuration and notify the origination shard (which is still withholding the request) to return a StaleConfigException to the Routing server, which will then re-read the latest configuration from the configServer and re-submit the previous requests. At some future point in time, data at the origination will be physically deleted.

It is possible that under a high frequency update condition, the changes being applied to the destination is unable to catch up with the update rate happen at the origination. When this situation is detected, the whole migration process will be aborted. The routing server may pick a different chunk to start the migration afterwards.

Map/Reduce Execution
MongoDb provide a Map/Reduce framework to perform parallel data processing. The concept is similar to
Hadoop Map/Reduce, but with the following small differences ...
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

  • It takes input from the query result of a collection rather than HDFS directory
  • The reduce output can be append to an existing collection rather than an empty HDFS directory

Map/Reduce in Mongo works in a slightly different way as follows
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

  1. Client define a map function, reduce function, query that scope the input data, and an output collection that store the output result.
  2. Client send the request to the MongoS routing server
  3. MongoS forward the request to the appropriated shards (route or scatter depends on whether partition key appears in the query). Notice that MongoS will pick one member of each shard, currently always send to the primaryDB
  4. Primary DB of each shard executes the query and pipe output to the user-defined map function, which emit a bunch of key value pairs stored in memory buffer. When the memory buffer is full, a user-defined reducer function will be invoked that partially reduce the key values pairs in the memory buffer, result stored on the local collection.
  5. When step (4) completes, the reduce function will be executed on all the previous partially reduced result to merge a single reduced result on this server.
  6. When step (5) finishes, MongoS notifies the corresponding shard servers that will store the output collection. (if the output collection is non-partitioned, only a single shard will be notified, otherwise all shards will be notified).
  7. The primary db of the shard(s) storing the final collection will call for every shard to collect the partially reduced data previously done. It will only ask for the result based on its corresponding key range.
  8. The primary db run the reduce() function again on the list of partially reduced result. Then store the final reduced result locally. If the user provide a finalize function, it will be invoked as well.

Here is a simple example to build an inverted index from document to topics
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

db.book.insert({title:"NOSQL",

             about:["software", "db"]})

 

db.book.insert({title:"Java programming",

             about:["software", "program"]})

 

db.book.insert({title:"Mongo",

             about:["db", "technology"]})

 

db.book.insert({title:"Oracle",

             about:["db", "software"]})

 

db.book.find()

 

m = function() {

 for (var i in this.about) {

     emit(this.about[i], this.title)

 }

}

 

r = function(k, vals) {

 return({topic:k, title:vals})

}

 

db.book.mapReduce(m, r, {query:{},

               out:{replace:"mroutput"}})

 

db.mroutput.find()

Overall speaking, I found MongoDb is very powerful and easy to use. I look forward to use MongoDb with Node.js and will share my experience in future blogs.

<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

Especially thanks to Jared Rosoff who provides me a lot of details of how MongoDb is implemented. 

转自【http://horicky.blogspot.jp/2012/04/mongodb-architecture.html

 

分享到:
评论

相关推荐

    mongodb architecture

    MongoDB作为一种流行的NoSQL数据库管理系统,以其高性能、高可用性以及易于扩展的特性获得了市场的广泛认可。下面从MongoDB架构的角度,深入分析其关键知识点: 1. MongoDB架构的优势:在最新版的MongoDB架构指南中...

    MongoDB_Architecture_Guide

    MongoDB架构指南向我们介绍了一个非关系型数据库系统MongoDB的设计原理和管理方式。作为数据库领域的重要组成部分,MongoDB的设计和实现代表了大数据时代对于存储解决方案的新要求。 首先,MongoDB的架构设计与传统...

    MongoDB Architecture Guide

    MongoDB架构指南是一份介绍MongoDB数据库系统架构特点的详细文档,该文档强调MongoDB的灵活性、高性能和易用性,以及如何适应现代应用构建和运维的需要。Eliot Horowitz,MongoDB的联合创始人和CTO,在引言部分阐述...

    MongoDB_Architecture_Guide(MongoDB架构指导)(个人翻译版)

    MongoDB是一种流行的开源文档数据库系统,以其灵活性、可扩展性和高性能而受到...提供的“MongoDB_Architecture_Guide.pdf”和“MongoDB.docx”文件会进一步详细介绍这些内容,帮助学习者深入掌握MongoDB的相关知识。

    MongoDB_Architecture_Guide 3.2

    ### MongoDB 架构指南 3.2 知识点概览 #### 一、引言与设计理念 **MongoDB 的设计理念**:MongoDB 不是在实验室里设计出来的,而是基于其创建者在构建大规模、高可用性系统过程中的实际经验而诞生的。其核心在于...

    mongodb-architecture-guide-july-2018

    综上所述,MongoDB-architecture-guide-july-2018涵盖了MongoDB的核心组件“mongod”的工作原理、复制集和分片等高级特性,以及MongoDB与Python的交互。这份指南对于希望深入了解MongoDB架构并优化其性能的开发人员...

    Practical MongoDB(Apress,2015)

    Practical Guide to MongoDB covers the data model, underlying architecture, coding with Mongo Shell, and administrating the MongoDB platform, among other topics. The book also provides clear guidelines...

    php配置mongodb扩展、安装mongodb服务教程

    1. 查看 PHP 版本:首先,需要查看 PHP 的版本号和 Architecture 项(x86 或 x64),以及 PHP Extension Build 项(支持 TS 或 NTS)。 2. 下载 MongoDB 扩展文件:访问 PECL 官方网站,下载适合自己 PHP 版本的 ...

    为了安装 MongoDB,应该选择适用于 ARM 架构的版本

    在 “Architecture” 部分,选择 “ARM 64 (aarch64)” 作为您的系统架构。 根据您的需求(例如,是否需要服务端和客户端二进制文件),下载相应的压缩文件。 下载完成后,解压文件并按照 MongoDB 安装指南进行安装...

    mongodb分片设计

    分片集群架构(Sharded Cluster Architectures)通常分为生产环境架构(Production Cluster Architecture)和测试环境架构(Sharded Cluster Test Architecture)。 在分片集群的部署和维护方面,文档提供了详细的...

    Practical MongoDB 无水印破地方 0分

    They cover the data model, underlying architecture, how to code using Mongo Shell, and administration of the MongoDB platform, among other topics. The book also provides clear guidelines and ...

    MongoDB in Action, 2nd Edition

    Its schema-free design encourages rapid application development, and built-in replication and auto-sharding architecture allow for massive parallel distribution. Production deployments at SourceForge...

    MongoDB分布式集群搭建

    [Replica Set 架构图](https://example.com/images/replica_set_architecture.png) 4. **测试环境配置** - 使用三台服务器(CentOS 7.8),分别为: - **192.168.56.200** (Master) - **192.168.56.201** (Slave) ...

    springboot 集成seata MongoDB xxxl-job

    Seata(Simple Extensible Autonomous Transaction Architecture)是阿里巴巴开源的分布式事务解决方案,它提供了AT、TCC、SAGA和XA四种事务模式,用于解决微服务架构下的分布式事务问题。Seata的主要特点是高性能、...

    node.js-clean-architecture:Node.js中Clean Architecture的用例,其中包括Express.js,MongoDB和Redis作为主要(但可替换)基础架构

    Node.js中的Clean Architecture用例,其中包括Express.js,MongoDB和Redis作为主要(但可替换)基础架构。 概述 此示例是一个简单的RESTful API应用程序,用户可以在其中使用Clean Architecture创建/更新/删除/查找...

    Loggio:Java Bases Log Framework with Elastic Search Distributed Architecture & MongoDB Map Reduce based Recommendation Logging

    Java Bases Log Framework with Elastic Search Distributed Architecture & MongoDB Map Reduce based Recommendation Logging 建筑学 建筑学 弹性搜索 MongoDB 和 Map Reduce 部署说明 要快速使用 Loggio,您必须...

Global site tag (gtag.js) - Google Analytics