`
mryufeng
  • 浏览: 982286 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

mnesia实现概述

阅读更多
不小心在 lib/mnesia/doc/misc/implementation.txt中找到的, 对于深入理解mnesia实现,阅读源码是很好的参考资料。

Mnesia

1 Introduction

This document aims to give a brief introduction of the implementation
of mnesia, it's data and functions.

Håkan has written other mnesia papers of interest, (see ~hakan/public_html/):
o Resource consumption (mnesia_consumption.txt)
o What to think about when changing mnesia (mnesia_upgrade_policy.txt)
o Mnesia internals course (mnesia_internals_slides.pdf)
o Mnesia overview (mnesia_overview.pdf)

1.1. Basic concepts

In a mnesia cluster all nodes are equal, there is no concept off
master or backup nodes. That said when mixing disc based (uses the
disc to store meta information) nodes and ram based (do not use disc
at all) nodes the disc based ones sometimes have precedence over ram
based nodes.

2 Meta Data

Mnesia has two types of global meta data, static and dynamic.
All the meta data is stored in the ets table mnesia_gvar.

2.1 Static Meta Data
The static data is the schema information, usually kept in
'schema.DAT' file, the data is created with
mnesia:create_schema(Nodes) for disc nodes (i.e. nodes which uses the
disc). Ram based mnesia nodes create an empty schema at startup.

The static data i.e. schema, contains information about which nodes
are involved in the cluster and which type (ram or disc) they have. It
also contains information about which tables exist on which node and
so on.

The schema information (static data) must always be the same on all
active nodes in the mnesia cluster. Schema information is updated via
schema functions, e.g. mnesia:add_table_copy/3,
mnesia:change_table_copy/3...

2.2 Dynamic Meta Data

The dynamic data is transient and is local to each mnesia node
in the cluster. Examples of dynamic data is: currently active mnesia
nodes, which tables are currently available and where are they
located. Dynamic data is updated internally by each mnesia during the
nodes lifetime, i.e. when nodes goes up and down or are added to or
deleted from the mnesia cluster.

3 Processes and Files

The most important processes in mnesia are mnesia_monitor,
mnesia_controller, mnesia_tm and mnesia_locker.

Mnesia_monitor acts as supervisor and monitors all resources.  It
listens for nodeup and nodedown and keeps links to other mnesia nodes,
if a node goes down it forwards the information to all the necessary
processes, e.g. mnesia_controller, mnesia_locker, mnesia_tm and all
transactions.  During start it negotiates the protocol version with
the other nodes and keep track of which nodes uses which version. The
monitor process also detects and warns about partioned networks, it is
then up to the user to deal with them. It is the owner of all open
files, ets tables and so on.

The mnesia_controller process is responsible for loading tables,
keeping the dynamic meta data updated, synchronize dangerous work such
as schema transactions vs dump log operation vs table loading/sending.

The last two processes are involved in all transactions, the
mnesia_locker process manages transaction locks, and mnesia_tm manages
all transaction work.

4 Startup and table Loading

The early startup is mostly driven by the mnesia_tm process/module,
logs are dumped (see log dumping), node-names of other nodes in the
cluster are retrieved from the static meta data or from environment
parameters and initial connections are made to the other mnesia
nodes.

The rest of start up is driven by the mnesia_controller process where
the schema (static meta data) is merged between each node, this is
done to keep the schema consistent between all nodes in the
cluster. When the schema is merged all local tables are put in a
loading queue, tables which are only available or have local content
is loaded directly from disc or created if they are type ram_copies.

The other tables are kept in the queue until mnesia decides whether to
load them from disk or from another node.  If another mnesia node has
already loaded the table, i.e. got a copy in ram or an open dets file,
the table is always loaded from that node to keep the data consistent.
If no other node has a loaded copy of the table, some mnesia node has
to load it first, and the other nodes can copy the table from the
first node. Mnesia keeps information about when other nodes went down,
a starting mnesia will check which nodes have been down, if some of
the nodes have not been down the starting node will let those nodes
load the table first. If all other nodes have been down then the
starting mnesia will load the table. The node that is allowed to load
the table will load it and the other nodes will copy it from that node.

If a node, which the starter node has not a 'mnesia_down' note from,
is down the starter node will have to wait until that node comes up
and decision can be taken, this behavior can be overruled by user
settings. The order of table loading could be described as:

1. Mnesia downs, Normally decides from where mnesia should load tables.
2. Master nodes (overrides mnesia downs).
3. Force load (overrides Master nodes).
   1) If possible, load table from active master nodes
   2) if no master nodes is active load from any active nodes,
   3) if no active node has an active table get local copy
      (if ram create empty one)

Currently mnesia can handle one download and one upload at the same
time. Dumping and loading/sending may run simultaneously but neither
of them may run during schema commit. Loaders/senders may not start if
a schema commit is enqueued. That synchronization is made to prohibit
that the schema transaction modifies the meta data and the
prerequisites of the table loading changes.

The actual loading of a table is implemented in 'mnesia_loader.erl'.
It currently works as follows:

Receiver Sender
-------- ------
Spawned
Find sender node
Queue sender request    ---->                
Spawned
*)Spawn real receiver <---- Send init info
*)Grab schema lock for Grab write table lock
  that table to avoid Subscribe receiver
  deadlock with schema transactions    to table updates
Create table (ets or dets) Release Lock

Get data (with ack ---->           
   as flow control)     <---- Burst data to receiver
Send no_more
Apply subscription messages
Store copy on disc Grab read lock
Create index, snmp data Update meta data info
and checkpoints if needed cleanup
no_more ---->
Release lock

*) Don't spawn or grab schema lock if operation is add_table_copy,
   it's already a schema operation.


5 Transaction

Transaction are normally driven from the client process, i.e. the
process that call 'mnesia:transaction'. The client first acquires a
globally unique transaction id (tid) and temporary transaction storage
(ts an ets table) from mnesia_tm and then executes the transaction
fun. Mnesia-api calls such as 'mnesia:write/1' and 'mnesia:read'
contains code for acquiring the needed locks. Intermediate database
states and acquired locks are kept in the transaction storage, and all
mnesia operations has to be "patched" against that store. I.e. a write
operation in a transaction should be seen within (and only within)
that transaction, if the same key is read after the write.
After the transaction fun is completed the ts is analyzed to see which
nodes are involved in the transaction, and what type of commit protocol
shall be used. Then the result is committed and additional work such as
snmp, checkpoints and index updates are performed. The transaction is
finish by releasing all resources.

An example:

Example = fun(X) ->
      {table1, key, Value} = mnesia:read(table1, key),
      ok = mnesia:write(table1, {table1, key, Value+X}),
      {table1, key, Updated} = mnesia:read(table1, key),
      Updated
  end,
mnesia:transaction(Example, [10]).

A message overview of a simple successful asynchronous transaction
                                                               non local
Client Process     mnesia_tm(local) mnesia_locker  mnesia_tm
------------------------------------------------------------------------
Get tid          ---->      
<---       Tid and ts
Get read lock
from available node  ------------------------------->
Value          <----------Value or restart trans---
Patch value against ts

Get write lock
from all nodes     ------------------------------->
                   ------------------------------->
ok's           <<---------ok's or restart trans---
write data in ts

Get read lock,already done.
Read data Value
'Patch' data with ts
Fun return Value+X.

If everything is ok
commit transaction

Find the nodes that the transaction
needs to be committed on and
collect every update from ts.

Ask for commit  ----------->
                ----------------------------------------------->

Ok's <<---------      ------------------------------
Commit          ----------------------------------------------->
log commit decision on disk
Commit locally
  update snmp
  update checkpoints
  notify subscribers
  update index
 
Release locks   ------------------------------->
Release transaction  ----->

Return trans result
----------------------

If all needed resources are available, i.e. the needed tables are
loaded somewhere in the cluster during the transaction, and the user
code doesn't crash, a transaction in mnesia won't fail. If something
happens in the mnesia cluster such as node down from the replica the
transaction was about to read from, or that a lock couldn't be
acquired and the transaction was not allowed to be queued on that
lock, the transaction is restarted, i.e. all resources are released
and the fun is called again. By default a transaction can be
restarted is infinity many times, but the user may choose to limit
the number of restarts.

The dirty operations don't do any of the above they just finds out
where to write the data, logs the operation to disk and casts (or call
in case of sync_dirty operation) the data to those nodes.  Therefore
the dirty operations have the drawback that each write or delete sends
a message per operation to the involved nodes.

There is also a synchronous variant of 2-phase commit protocol which
waits on an additional ack message after the transaction is committed
on every node. The intention is to provide the user with a way to
solve overloading problems.

A 3-phase commit protocol is used for schema transaction or if the
transaction result is going to be committed in a asymmetrical way,
i.e. a transaction that writes to table a and b where table a and b
have replicas on different nodes. The outcome of the transactions are
stored temporary in an ets table and in the log file.

6 Schema transactions

Schema transactions are handled differently than ordinary
transactions, they are implemented in mnesia_schema (and in
mnesia_dumper). The schema operation is always spawned to protect from
that the client process dies during the transaction.

The actual transaction fun checks the pre-conditions and acquires the
needed locks and notes the operation in the transaction store. During
the commit, the schema transaction runs a schema prepare operation (on
every node) that does the needed prerequisite job. Then the operation
is logged to disc, and the actual commit work is done by dumping the
log. Every schema operation has special clause in mnesia_dumper to
handle the finishing work. Every schema prepare operation has a
matching undo_prepare operation which needs to be invoked if the
transaction is aborted.

7 Locks

"The locking algorithm is a traditional 'two-phase locking'* and the
deadlock prevention is 'wait-die'*, time stamps for the wait-die algorithm
is 'Lamport clock'* maintained by mnesia_tm. The Lamport clock is kept
when the transaction is restarted to avoid starving."

* References can be found in the paper mnesia_overview.pdf
  Klacke, Håkan and Hans wrote about mnesia.

What the quote above means is that read locks are acquired on the
replica that mnesia read from, write locks are acquired on all nodes
which have a replica. Several read lock can lock the same object, but
write locks are exclusive. The transaction identifier (tid) is a ever
increasing system uniq counter which have the same sort order on every
node (a Lamport clock), which enables mnesia_locker to order the lock
requests. When a lock request arrives, mnesia_locker checks whether
the lock is available, if it is a 'granted' is sent back to the client
and the lock is noted as taken in an ets table. If the lock is already
occupied, it's tid is compared with tid of the transaction holding the
lock. If the tid of holding transaction is greater than the tid of
asking transaction it's allowed to be put in the lock queue (another
ets table) and no response is sent back until the lock is released, if
not the transaction will get a negative response and mnesia_tm will
restart the transaction after it has slept for a random time.

Sticky locks works almost as a write lock, the first time a sticky
lock is acquired a request is sent to all nodes. The lock is marked as
taken by the requesting node (not transaction), when the lock is later
released it's only released on the node that has the sticky lock,
thus the next time a transaction is requesting the lock it don't need
to ask the others nodes. If another node wants the lock it has to request
a lock release first, before it can acquire the lock.

8 Fragmented tables

Fragmented tables are used to split a large table in smaller parts.
It is implemented as a layer between the client and mnesia which
extends the meta data with additional properties and maps a {table,
key} tuple to a table_fragment.

The default mapping is erlang:phash() but the user may provide his own
mapping function to be able to predict which records is stored in
which table fragment, e.g.  the client may want to steer where a
record generated from a certain device is placed.

The foreign key is used to co-locate other tables to the same node.
The other additinal table attributes are also used to distribute the
table fragments.

9 Log Dumping

All operations on disk tables are stored on a log 'LATEST.LOG' on
disk, so mnesia can redo the transactions if the node goes down.
Dumping the log means that mnesia moves the committed data from the
general log to the table specific disk storage. To avoid that the log
grows to large and uses a lot of disk space and makes the startup slow,
mnesia dumps the log during it's uptime. There are two triggers that
start the log dumping, timeouts and the number of commits since last
dump, both are user configurable.

Disc copies tables are implemented with two disk_log files, one
'table.DCD' (disc copies data) and one 'table.DCL' (disc copies log).
The dcd contains raw records, and the dcl contains operations on that
table, i.e. '{write, {table, key, value}}' or '{delete, {table,
key}}'.  First time a record for a specific table is found when
dumping the table, the size of both the dcd and the dcl files are
checked. And if the sizeof(dcl)/sizeof(dcd) is greater than a
threshold, the current ram table is dumped to file 'table.DCD' and the
corresponding dcl file is deleted, and all other records in the
general log that belongs to that table are ignored. If the threshold
is not meet than the operations in the general log to that table are
appended to the dcl file. On start up both files are read, first the
contents of the dcd are loaded to an ets table, then it's modified by
the operations stored in the corresponding dcl file.

Disc only copies tables updates the 'dets' file directly when
committing the data so those entries can be ignored during normal log
dumping, they are only added to the 'dets' file during startup when
mnesia don't know the state of the disk table.

10 Checkpoints and backups

Checkpoints are created to be able to take snapshots of the database,
which is pretty good when you want consistent backups, i.e. you don't
want half of a transaction in the backup. The checkpoint creates a
shadow table (called retainer) for each table involved in the
checkpoint. When a checkpoint is requested it will not start until all
ongoing transactions are completed. The new transactions will update
both the real table and update the shadow table with operations to
undo the changes on the real table, when a key is modified the first
time.  I.e. when write operation '{table, a, 14}' is made, the shadow
table is checked if key 'a' has a undo operation, if it has, nothing
more is done. If not a {write, {table, a, OLD_VALUE}} is added to the
shadow table if the real table had an old value, if not a {delete,
{table, a}} operation is added to the shadow table.

The backup is taken by copying every record in the real table and then
appending every operation in the shadow table to the backup, thus
undoing the changes that where made since the checkpoint where
started.


分享到:
评论
6 楼 mryufeng 2009-10-19  
whrllm 写道
老大啥时候也把文中提到的这些文档也挖出来呀,呵呵 期待呀...
o Resource consumption (mnesia_consumption.txt)
o What to think about when changing mnesia (mnesia_upgrade_policy.txt)
o Mnesia internals course (mnesia_internals_slides.pdf)
o Mnesia overview (mnesia_overview.pdf)


上面的几个材料 都能google的到的
5 楼 whrllm 2009-10-19  
老大啥时候也把文中提到的这些文档也挖出来呀,呵呵 期待呀...
o Resource consumption (mnesia_consumption.txt)
o What to think about when changing mnesia (mnesia_upgrade_policy.txt)
o Mnesia internals course (mnesia_internals_slides.pdf)
o Mnesia overview (mnesia_overview.pdf)
4 楼 litaocheng 2009-09-23  
真是细致入微啊...
3 楼 mryufeng 2009-09-16  
R13B02对mnesia的梳理很大 觉得是时候把这个东西做的更好了。
2 楼 bachmozart 2009-09-16  
期待老大以后每天都不小心一下
1 楼 dennis_zane 2009-09-16  
多谢老大分享。

相关推荐

    Mnesia User's Guide

    session, specify a Mnesia database directory, initialize a database schema, start Mnesia, and create tables. Initial prototyping of record definitions is also discussed. • Build a Mnesia Database ...

    Api-Social-Amnesia.zip

    Api-Social-Amnesia.zip,忘记过去。社交健忘症确保你的社交媒体帐户只显示你最近的历史,而不是5年前“那个阶段”的帖子。,一个api可以被认为是多个软件设备之间通信的指导手册。例如,api可用于web应用程序之间的...

    amnesia:失忆备忘录

    B站视频地址: 做了文字校验,已经成功上线,有兴趣的小伙伴可以扫码体验:可以微信搜索:失忆备忘录一、失忆的由来之所以开发这款软件,是因为在那段时间事情很多,但是经常忘记。虽然市面上类似的功能很多,我之前...

    amnesia-开源

    AMNESIA是一个基于Erlang编程语言的开源库,专门设计用于简化与关系数据库管理系统(RDBMS)的交互。Erlang以其并发处理、容错性和高效性能在分布式系统领域备受推崇,而AMNESIA则进一步扩展了Erlang的功能,使...

    Chrome Amnesia-crx插件

    语言:English (United States) 遗忘的延伸 Chrome失忆症是一个Chrome扩展程序,可让您有选择地不记得自己的任何浏览历史记录。...有关更多信息,请访问https://github.com/DanielBok/chrome-amnesia。

    Amnesia-开源

    失忆症是一种提醒,允许您定义警报,贴纸(贴子)以提醒您一些重要的内容以及有关所需内容的注释。 可以将警报编程为在给定时间显示,可以在桌面上放置贴纸以随时查看。

    Mnesia用户手册.zip

    《Mnesia用户手册》是专为理解和操作Erlang编程语言中的Mnesia数据库管理系统而编写的详尽指南。Mnesia是Erlang OTP (Open Telephony Platform) 库中的一个核心组件,它是一个强大的分布式数据库系统,特别适用于...

    Amnesia

    "Amnesia"是一个可能与计算机安全或数据丢失相关的主题,暗示了系统或用户可能遭遇了某种形式的记忆丧失,即数据无法访问或丢失的情况。在IT领域,这种情况通常涉及到磁盘故障、病毒攻击、误操作或者软件错误。"Post...

    erlang——Mnesia用户手册.pdf

    8.附录.A:Mnesia.错误信息 8.1.Mnesia.中的错误 9.附录.B:备份回调函数接口 9.1.Mnesia.备份回调行为 10.附录.C:作业存取回调接口 10.1.Mnnesia.存取回调行为 11.附录.D:分片表哈希回调接口 11.1....

    mnesia数据库文档

    ### Mnesia数据库:Erlang中的分布式数据库管理系统 #### 引言 Mnesia,作为Erlang编程语言的一部分,是一款由爱立信公司开发的分布式数据库管理系统(DBMS)。自1997年以来,Mnesia一直是Open Telecom Platform...

    amnesia_cck:“ amensia”项目的源代码,已针对CCK展示进行了修改-Show source code

    "Amnesia_CCK"是一个基于开源系统开发的项目,它主要关注的是"失忆症"游戏的源代码,经过特定的调整和优化,以便在CCK(Content Creation Kit)平台上进行展示。CCK通常是一个工具集,允许用户创建、修改和扩展游戏...

    Mnesia table fragmentation 过程及算法分析

    在 Erlang 中实现 Linear Hashing 需要考虑几个方面。首先,Hash 表的效率与总元素数和 Bucket 数的比例(N/B)有关。比例越高,查找效率就越低。为了保持 Hash 表的性能,需要控制 Bucket 的数量。Bucket 数量可以...

    Mnesia用户手册 4.4.10版.rar

    8 附录 A : Mnesia 错误信息 . . .. . . 75 8.1 Mnesia 中的错误 . . . . .. . 75 9 附录 B :备份回调函数接口 . . .. . .. . . .. . 76 9.1 Mnesia 备份回调行为 . . .. . . . .. . 76 10 附录 C :作业存取...

Global site tag (gtag.js) - Google Analytics