The high-level organization of Pregel programs is inspired by Valiant’s Bulk Synchronous Parallel model . Pregel
computations consist of a sequence of iterations, called supersteps. During a superstep the framework invokes a user-defined function for each vertex, conceptually in parallel.The function specifies behavior at a single vertex V and a single superstep S. It can read messages sent to V in superstep S − 1, send messages to other vertices that will be received at superstep S + 1, and modify the state of V and its outgoing edges. Messages are typically sent along outgoing edges, but a message may be sent to any vertex whose identifier is known.
The input to a Pregel computation is a directed graph in which each vertex is uniquely identified by a string vertex
identifier. Each vertex is associated with a modifiable, user defined value. The directed edges are associated with their source vertices, and each edge consists of a modifiable, user defined value and a target vertex identifier.
A typical Pregel computation consists of input, when the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm terminates, and finishing with output.
Within each superstep the vertices compute in parallel,each executing the same user-defined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its outgoing edges, receive messages sent to it in the previous superstep, send messages to other vertices (to be received in the next superstep), or even mutate the topology of the graph. Edges are not first-class citizens in this model, having no associated computation.
Algorithm termination is based on every vertex voting to halt. In superstep 0, every vertex is in the active state; all active vertices participate in the computation of any given superstep. A vertex deactivates itself by voting to halt. This means that the vertex has no further work to do unless triggered externally, and the Pregel framework will not execute that vertex in subsequent supersteps unless it receives a message. If reactivated by a message, a vertex must explicitly deactivate itself again. The algorithm as a whole terminates when all vertices are simultaneously inactive and there are no messages in transit. This simple state machine is illustrated in Figure 1.
The output of a Pregel program is the set of values explicitly output by the vertices. It is often a directed graph
isomorphic to the input, but this is not a necessary property of the system because vertices and edges can be added and removed during computation. A clustering algorithm,for example, might generate a small set of disconnected vertices selected from a large graph. A graph mining algorithm might simply output aggregated statistics mined from the graph.
Figure 2 illustrates these concepts using a simple example:given a strongly connected graph where each vertex contains a value, it propagates the largest value to every vertex. In each superstep, any vertex that has learned a larger value from its messages sends it to all its neighbors. When no further vertices change in a superstep, the algorithm terminates.
We chose a pure message passing model, omitting remote reads and other ways of emulating shared memory, for two reasons. First, message passing is sufficiently expressive that there is no need for remote reads. We have not found any graph algorithms for which message passing is insufficient.Second, this choice is better for performance. In a cluster environment, reading a value from a remote machine incurs high latency that can’t easily be hidden. Our message passing model allows us to amortize latency by delivering messages asynchronously in batches.
Graph algorithms can be written as a series of chained MapReduce invocations [11, 30]. We chose a different model for reasons of usability and performance. Pregel keeps vertices and edges on the machine that performs computation,and uses network transfers only for messages. MapReduce,however, is essentially functional, so expressing a graph algorithm as a chained MapReduce requires passing the entire state of the graph from one stage to the next—in general requiring much more communication and associated serialization overhead. In addition, the need to coordinate the steps of a chained MapReduce adds programming complexity that is avoided by Pregel’s iteration over supersteps.
The orignal paper see:
Grzegorz Malewicz, Matthew H. Austern, 《Pregel: A System for Large-Scale Graph Processing》
相关推荐
1. **图(Graph)**: 图是由节点(Vertices)和边(Edges)组成的集合,节点表示实体,边表示实体之间的关系。 2. **顶点(Vertices)**: 顶点是图中的基本单元,可以代表任何对象,如用户、网页等。 3. **边(Edges...
本文将详细介绍如何在 Spark 上实现大规模图处理,特别是通过 Bagel 实现类似 Google Pregel 的编程模型。 #### 动机 Google 的 Pregel 是一种非常方便的图计算模型,它为解决大规模图问题提供了强大的工具。然而...
6个pdf,Google官方发布的。 [1]Bigtable: A Distributed Storage System for Structured Data [2]MapReduce: Simplified Data Processing on Large Clusters ...[6]Pregel: A System for Large-Scale Graph Processing
这个框架的设计灵感来源于Google的两篇著名论文——"MapReduce: Simplified Data Processing on Large Clusters"和"Pregel: A System for Large-Scale Graph Processing"。`Mapper`和`Reducer`是MapReduce模型中的两...
7. **Pregel: A System for Large-Scale Graph Processing**:Pregel 是谷歌研发的图形处理系统,适用于大规模图数据的处理。它基于顶点为中心的思想,通过迭代的方式进行图计算,有效提升了图形处理效率。 8. **...
Apache Hadoop---Giraph是Google的Pregel系统开源实现,专为大规模图的分布式计算设计。Giraph基于Hadoop构建,继承了Pregel的计算模型,并增加了如out-of-core computation(离核心计算)和edge-oriented input...
Colossus Papers: spanner, Pregel, Dremel, Caffeine. A second generation of google file system and large-scale distributed computing patforms and database
3. **《Pregel - A System for Large-Scale Graph Processing》**:Pregel是一种处理大规模图数据的计算框架,适用于社交网络分析、推荐系统等应用。它提供了一种迭代模型,允许节点间通信,并支持并行和分布式执行...
使用Pregel和PageRank算法进行图分析已实施的操作基于图度的社交图中大多数连接的用户。 基于单用户分离度。 输入是用户的ID-输出是具有用户的元组列表以及它们之间的分隔度。 两个定义的用户之间的隔离度(作为单个...
2010年,谷歌公司推出了名为Pregel的系统,为在商品集群和云环境中进行图形处理开启了一个新时代。在Pregel之后,多种具有不同编程模型和特点的分布式图形处理框架相继出现。 文章进一步提出,为了更好地理解现有的...
程序构建:~/reef-pregel$ mvn 全新安装程序执行: /reef-pregel/bin$ ./run.sh -input file:/// /reef-pregel/bin/数据集(如果您在 hadoop 上运行此程序,请使用此选项:-local false) 如何验证输出:$ grep -r "R...
### Pregel:大规模图处理系统 #### 一、引言与背景 随着社交网络、Web 图等现代系统的兴起,大型图数据集成为了计算任务的重要组成部分。这些数据集包含了大量的节点(顶点)和边,涉及到诸如最短路径、聚类分析...
标题"Challenges & Opportunities in Graph Processing at Alibaba"揭示了在阿里巴巴进行图处理时所面临的难题以及潜在的发展机遇。 首先,让我们深入探讨图处理的挑战: 1. **大数据量**:阿里巴巴拥有海量用户和...
6. **shagunsodhani-pregel-5ad1abe**:这个文件可能是GitHub上的一个代码仓库,包含了作者shagunsodhani对Pregel和GCN实现的源代码。通过查看和学习这个代码,你可以了解具体的实现细节,包括文件组织、函数定义、...
GraphX的内部架构基于Pregel模型,同时结合了GraphLab的一些设计理念。 - **3个核心的RDD**: - Vertices:表示图中的节点集合。 - Edges:表示图中的边集合。 - Triplets:由源节点、目标节点和边组成的三元组...
DMID 在信息系统亚琛工业大学(RWTH Aachen University)主席的学士论文“ Pregel:重叠社区检测算法的并行实现”中,实现了针对giraph的重叠社区检测算法DMID的实现。 ## SETUP在下面,我们将描述如何在Ubuntu 64位...
- **生态系统丰富**:除了核心的Spark框架之外,还有一系列围绕Spark构建的工具和服务,如Spark SQL、Spark Streaming、MLlib、GraphX等,这些组件共同构成了一个完整的大数据分析平台。 #### 七、结论 Apache ...
综上所述,通过对Spark GraphX的深入源码分析,可以看出GraphX是如何结合Pregel和GraphLab的优点,将图计算与Spark的大数据处理能力相结合的。同时,GraphX通过提供各种图运算操作、优化存储结构和利用Pregel API等...
为了克服这些挑战,阿里巴巴可能正在探索分布式图处理框架,如Pregel、GraphX或JanusGraph,以实现水平扩展和并行计算。同时,研究和开发新的图算法以减少计算复杂性,提高查询效率,也是关键。此外,结合硬件加速...