Schema design in MongoDB is very different than schema design in a relational DBMS. However it is still very important and the first step towards building an application.
In relational data models, conceptual there is a "correct" design for a given entity relationship model independent of the use case. This is typically a third normal form normalization. One typically only diverges from this for performance reasons. In MongoDB, the schema design is not only a function of the data to be modeled but also of the use case. The schema design is optimized for our most common use case. This has pros and cons – that use case is then typically highly performant; however there is a bias in the schema which may make certain ad hoc queries a little less elegant than in the relational schema.
As we design the schema, the questions we must answer are:
1. When do we embed data versus linking (see below)? Our decisions here imply the answer to question #2:
2. How many collections do we have, and what are they?
3. When do we need atomic operations? These operations can be performed within the scope of a BSON document, but not across documents.
4. What indexes will we create to make query and updates fast?
5. How will we shard? What is the shard key?
Embedding and Linking
A key question when designing a MongoDB schema is when to embed and when to link. Embedding is the nesting of objects and arrays inside a BSON document. Links are references between documents.
There are no joins in MongoDB – distributed joins would be difficult on a 1,000 server cluster. Embedding is a bit like "prejoined" data. Operations within a document are easy for the server to handle; these operations can be fairly rich. Links in contrast must be processed client-side by the application; the application does this by issuing a follow-up query.
Generally, for "contains" relationships between entities, embedding should be be chosen. Use linking when not using linking would result in duplication of data.
Collections
Collections in MongoDB are analogous to tables in a relational database. Each collection contains documents. As mentioned above these documents can be fairly rich.
There is no explicit declaration of the fields within the documents of a collection. However there is a conceptual design from the schema designer of what those fields will be and how the documents in the collection will be structured. MongoDB does not require documents in a collection to have the same structure. However, in practice, most collections are highly homogenous. We can move away from this when we need to though; for example when adding a new field. In a situation such as that, an "alter table" style operation is not necessary.
Atomic Operations
Some problems require the ability to perform atomic operations. For example, simply incrementing a counter is often a case where one wants atomicity. MongoDB can also perform more complex operations such as that shown in the pseudocode below:
atomically {
if( doc.credits > 5 ) {
doc.credits -= 5;
doc.debits += 5;
}
}
Another example would be a user registration scenario. We would never want to users to register the same username simultaneously:
atomically {
if( exists a document with username='jane' ) {
print "username already in use please choose another";
} else {
insert a document with username='jane' in the users collection;
print("thanks you have registered as user jane.");
}
}
The key aspect here in terms of schema design is that our scope of atomicity / ACID properties is the document. Thus we need to assure that all fields relevant to the atomic operation are in the same document.
Indexes
MongoDB supports the declaration of indexes. Indexes in MongoDB are highly analogous to indexes in a relational database : they are needed for efficient query processing, and must be explicitly declared. Thus we need to think about what indexes we will define as part of the schema design process. Just like in a relational database, indexes can be added later – if we decide later to have a new one, we can do that.
Sharding
Another consideration for schema design is sharding. A BSON document (which may have significant amounts of embedding) resides on one and only one shard.
A collection may be sharded. When sharded, the collection has a shard key, which determines how the collection is partitioned among shards. Typically (but not always) queries on a sharded collection involve the shard key as part of the query expression.
The key here is that changing shard keys is difficult. You will want to choose the right key from the start.
Example
Let's consider an example, which is a content management system. The examples below use mongo shell syntax but could also be done in any programming language – just use the appropriate driver for that language.
Our content management system will have posts. Post have authors. We'd like to support commenting and voting on posts. We'd also like posts to be taggable for searching.
One good schema design for this scenario would be to have two MongoDB collections, one called posts and one called users. This is what we will use for the example.
Our users have a few properties - a user id they registered with, their real name, and their karma. For example we could invoke:
> db.users.insert( { _id : "alex", name: { first:"Alex", last:"Benisson" }, karma : 1.0 } )
The _id field is always present in MongoDB, and is automically indexed with a unique key constraint. That is perfect for our usernames so we store them in the _id field. We don't have to though; we could instead make a username field and let the system automically generate a unique id.
Let's now consider posts. We'll assume some posts are already populated. Let's query one:
> db.posts.findOne()
{
_id : ObjectId("4e77bb3b8a3e000000004f7a"),
when : Date("2011-09-19T02:10:11.3Z",
author : "alex",
title : "No Free Lunch",
text : "This is the text of the post. It could be very long.",
tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
comments : [
{ who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
comment : "I agree." },
{ who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
comment : "You must be joking. etc etc ..." }
]
}
It's interesting to contrast this with how we might design the same schema in a relation database. We would likely have a users collection and a posts collection. But in addition one would typically have a tags collection, a voters collection, and a comments collection. Grabbing all the information on a single post would then be a little involved. Here to grab the whole post we might execute:
> db.posts.findOne( { _id : ObjectId("4e77bb3b8a3e000000004f7a") } );
To get all posts written by alex:
> db.posts.find( { author : "alex" } )
If the above is a common query we would create an index on the author field:
> db.posts.ensureIndex( { author : 1 } )
The post documents can be fairly large. To get just the titles of the posts by alex :
> db.posts.find( { author : "alex" }, { title : 1 } )
We may want to search for posts by tag:
> // make and index of all tags so that the query is fast:
> db.posts.ensureIndex( { tags : 1 } )
> db.posts.find( { tags : "business" } )
What if we want to find all posts commented on by meghan?
> db.posts.find( { comments.who : "meghan" } )
Let's index that to make it fast:
> db.posts.ensureIndex( { "comments.who" : 1 } )
We track voters above so that no one can vote more than once. Suppose calvin wants to vote for the example post above. The following update operation will record calvin's vote. Because of the $nin sub-expression, if calvin has already voted, the update will have no effect.
> db.posts.update( { _id : ObjectId("4e77bb3b8a3e000000004f7a"),
voters : { $nin : "calvin" } },
{ votes : { $inc : 1 }, voters : { $push : "calvin" } );
Note the above operation is atomic : if multiple users vote simultaneously, no votes would be lost.
Suppose we want to display the title of the latest post in the system as well as the full user name for the author. This is a case where we must use client-side linking:
> var post = db.posts.find().sort( { when : -1 } ).limit(1);
> var user = db.users.find( { _id : post.author } );
> print( post.title + " " + user.name.first + " " + user.name.last );
A final question we might ask about our example is how we would shard. If the users collection is small, we would not need to shard it at all. If posts is huge, we would shard it. We would need to choose a shard key. The key should be chosen based on the queries that will be common. We want those queries to involve the shard key.
Sharding by _id is one option here.
If finding the most recent posts is a very frequent query, we would then shard on the when field. (There is also an optimization trick which might work here.)
Summary of Best Practices
- "First class" objects, that are at top level, typically have their own collection.
- Line item detail objects typically are embedded.
- Objects which follow an object modelling "contains" relationship should generally be embedded.
- Many to many relationships are generally done by linking.
- Collections with only a few objects may safely exist as separate collections, as the whole collection is quickly cached in application server memory.
- Embedded objects are a bit harder to link to than "top level" objects in collections.
- It is more difficult to get a system-level view for embedded objects. When needed an operation of this sort is performed by using MongoDB's map/reduce facility.
- If the amount of data to embed is huge (many megabytes), you may reach the limit on size of a single object. See also GridFS.
- If performance is an issue, embed.
More Details
Choosing Indexes
A second aspect of schema design is index selection. As a general rule, where you want an index in a relational database, you want an index in Mongo.
The _id field is automatically indexed.
Fields upon which keys are looked up should be indexed.
Sort fields generally should be indexed.
The MongoDB profiling facility provides useful information for where an index should be added that is missing.
Note that adding an index slows writes to a collection, but not reads. Use lots of indexes for collections with a high read : write ratio (assuming one does not mind the storage overage). For collections with more writes than reads, indexes are expensive as keys must be added to each index for each insert.
How Many Collections?
As Mongo collections are polymorphic, one could have a collection objects and put everything in it! This approach is taken by some object databases. This is not recommended in MongoDB for several reasons, mainly performance. Data within a single collection is roughly contiguous on disk. Thus, "table scans" of a collection are possible, and efficient. Just like in relational dbs, independent collections are very important for high throughput batch processing.
分享到:
相关推荐
The Little MongoDB Schema Design Book, covers the fundamentals off Schema design with MongoDB, as well as several useful Schema design patters for your applications. I wrote this book to be a helpful...
构建MongoDB Schema Simulator Tool的目的是允许模拟《 The Little MongoDB Schema Design Book概述的The Little MongoDB Schema Design Book 。 链接 安装工具 安装该工具非常简单 npm install -g mongodb-schema-...
从给定的文件信息中,我们可以提炼出关于MongoDB架构设计基础的重要知识点,涉及数据建模的历史、目的、以及在MongoDB中实现这些设计的具体方法。 ### 数据建模简史 数据建模的历史源远流长,从最早的ISAM...
### MongoDB Meetup Schema设计 #### 一、简介与原则 本次演讲由10gen的工程师Kyle Banker在2010年9月21日的MongoNYC会议上进行。主要内容涵盖了MongoDB的基本介绍、设计原则以及如何在MongoDB中进行有效的模式...
Focus on data usage and better design schemas with the help of MongoDB About This Book Create reliable, scalable data models with MongoDB Optimize the schema design process to support applications of...
Application developers love MongoDB, a document-... A reference section on schema design patterns helps ease the transition from the relational data model of SQL to MongoDB’s document-based data model.
"mongodb-schema-design"项目专注于JSON模式定义,为构建一个灵活的广告分类引擎提供了数据模型的指导。 JSON(JavaScript Object Notation)模式定义在MongoDB中起到了类似传统关系数据库中的表结构的作用,但它...
Application Design Tips: What to keep in mind when designing your schema Implementation Tips: Programming applications against MongoDB Optimization Tips: Speeding up your applications Data Safety Tips...
4. **Mongoose**:在Node.js中,Mongoose是与MongoDB交互的库,提供了模型(Model)和操作(Schema)的概念,方便数据验证和操作,使数据库操作更加简洁。 5. **路由(Routing)**:在Express中,路由用于定义HTTP...
这个项目是无聊的时候做的,主要客户端为移动端,利用唤醒URL schema原理, 可以将自己喜欢的资源通过百度云分享上传到平台 利用平台的共享性 或许自己的所需资源,相当于一个百度云盘的集合,整体项目分三部分 采用...
在SafeAuth后端,Mongoose为我们定义了用户模型(User Schema),方便地进行数据操作,如创建新用户、查询用户信息等。 ### 六、环境配置与安全实践 在后端,`.env`文件用于存储敏感信息,如数据库连接字符串,...
- Enhancing user interaction through responsive design. 5. **Advanced Topics and Best Practices**: - Deploying Flask applications to production environments. - Debugging and testing strategies. -...