`

Google Architecture

阅读更多

Google Architecture

Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks.
Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Information Sources

  • Video: Building Large Systems at Google
  • Google Lab: The Google File System
  • Google Lab: MapReduce: Simplified Data Processing on Large Clusters
  • Google Lab: BigTable .
  • Video: BigTable: A Distributed Structured Storage System .
  • Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems .
  • How Google Works by David Carr in Baseline Magazine.
  • Google Lab: Interpreting the Data: Parallel Analysis with Sawzall .
  • Dare Obasonjo's Notes on the scalability conference.

    Platform

  • Linux
  • A large diversity of languages: Python, Java , C++

    What's Inside?

    The Stats

  • Estimated 450,000 low-cost commodity servers in 2006
  • In 2005 Google indexed 8 billion web pages. By now, who knows?
  • Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
  • Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
  • BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

    The Stack

    Google visualizes their infrastructure as a three layer stack:

  • Products: search, advertising, email , maps, video, chat, blogger
  • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
  • Computing Platforms: a bunch of machines in a bunch of different data centers
  • Make sure easy for folks in the company to deploy at a low cost.
  • Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

    Reliable Storage Mechanism with GFS (Google File System)

  • Reliable scalable storage is a core need of any application. GFS is their core storage platform.
  • Google File System - large distributed log structured file system in which they throw in a lot of data.
  • Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required:
    - high reliability across data centers
    - scalability to thousands of network nodes
    - huge read/write bandwidth requirements
    - support for large blocks of data which are gigabytes in size.
    - efficient distribution of operations across nodes to reduce bottlenecks
  • System has master and chunk servers.
    - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.
    - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
  • A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
  • Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

    Do Something With the Data Using MapReduce

  • Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
  • MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
  • Why use MapReduce?
    - Nice way to partition tasks across lots of machines.
    - Handle machine failure.
    - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc.
    - Computation can automatically move closer to the IO source.
  • The MapReduce system has three different types of servers.
    - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
    - The Map servers accept user input and performs map operations on them. The results are written to intermediate files
    - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
  • For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
    - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
    - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count.
    - Shuffling aggregates key types.
    - The reductions sums up all the key value pairs and produces the final answer.
  • The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
  • Programs can be very small. As little as 20 to 50 lines of code.
  • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
  • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

    Storing Structured Data in BigTable

  • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
  • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
  • Commercial databases simply don't scale to this level and they don't work across 1000s machines.
  • By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
  • Machines can be added and deleted while the system is running and the whole system just works.
  • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
  • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
  • BigTable has three different types of servers:
    - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
    - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
    - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
  • A locality group can be used to physically store related bits of data together for better locality of reference.
  • Tablets are cached in RAM as much as possible.

    Hardware

  • When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
  • Use ultra cheap commodity hardware and built software on top to handle their death.
  • A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
  • Linux, in-house rack design, PC class mother boards, low end storage.
  • Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
  • Use a mix of collocation and their own data centers.

    Misc

  • Push changes out quickly rather than wait for QA.
  • Libraries are the predominant way of building programs.
  • Some are applications are provided as services, like crawling.
  • An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

    Future Directions for Google

  • Support geo-distributed clusters.
  • Create a single global namespace for all data. Currently data is segregated by cluster.
  • More and better automated migration of data and computation.
  • Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

    Lessons Learned

  • Infrastructure can be a competitive advantage . It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.
  • Spanning multiple data centers is still an unsolved problem . Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.
  • Take a look at Hadoop (product ) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.
  • An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.
  • Synergy isn't always crap . By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.
  • Build self-managing systems that work without having to take the system down . This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.
  • Create a Darwinian infrastructure . Perform time consuming operation in parallel and take the winner.
  • Don't ignore the Academy . Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment .
  • Consider compression . Compression is a good option when you have a lot of CPU to throw around and limited IO.
  • 分享到:
    评论

    相关推荐

      实用架构教程-谷歌架构英文版Google Architecture

      http://highscalability.com/google-architecture http://weibo.com/developerworks 2012-11-11 整理 第 1/9页 Large Clusters 4. Google Lab: BigTable. 5. Video: BigTable: A Distributed Structured Storage ...

      google architecture sample

      "google architecture sample"就是一个这样的项目,它包含了多种不同的应用程序架构模式,以供开发者学习和参考。以下是对这些架构模式的详细说明: 1. **MVP(Model-View-Presenter)**: MVP是一种流行的设计...

      Google Architecture.pdf

      ### Google架构核心知识点 #### 一、概述 Google作为一个全球领先的科技公司,在处理大规模数据集方面有着独到的技术优势。其技术栈的核心是基于一系列专有的分布式计算框架和技术,包括但不限于BigTable、...

      google-analytics-architecture

      google-analytics-architecture

      Android Architecture_googlesamples.zip

      "Android Architecture_googlesamples.zip"是谷歌提供的一组开源项目,旨在帮助开发者理解和实践各种Android应用程序的体系结构工具和模式。这个压缩包中的"architecture-samples-master"目录,包含了多个示例应用,...

      DaggerViewModel:一个集成模块,用于将Google Architecture组件的ViewModel注入Dagger2注入的Android组件

      集成模块,用于将Google Architecture Components的注入注入Android活动和片段。 该库的灵感来自于官方的示例。 安装 在Android build.gradle文件的“ dependencies部分中添加以下行之一: implementation '...

      Android代码-Android Architecture_googlesamples

      Android Architecture Blueprints The Android framework provides a lot of flexibility in deciding how to organize and architect an Android app. While this freedom is very valuable, it can also lead to ...

      Computer Architecture, Sixth Edition,A Quantitative Approach

      For over 20 years, Computer Architecture: A Quantitative Approach has been considered essential reading by instructors, students, and practitioners of computer design. The latest edition of this ...

      computer architecture a quantitative approach 6th.rar

      It also includes a new chapter on domain-specific architectures and an updated chapter on warehouse-scale computing that features the first public information on Google's newest WSC.

      北航云计算公开课05a Google storage_architecture_and_challenges

      ### 北航云计算公开课05a:Google存储架构与挑战 #### 一、引言 在本次公开课中,Google的首席工程师安德鲁·菲克斯(Andrew Fikes)分享了Google如何构建和维护其全球规模的存储系统。他强调了几个关键点,包括...

      Android4TV - SW Architecture v1.5

      《Android4TV - SW Architecture v1.5》是关于Android TV软件架构的详细解析文档,主要针对Java开发者。本文将深入探讨Android TV平台的软件架构,包括其核心组件、服务、应用程序开发要点以及与传统Android系统的...

      solution for computer architecture 5th edition (appendix)

      最近在学习 computer architecture a quantitative approach 5th edition,发现很多下载的答案没有appendix,用Google找了很久找到了appendix的答案,希望给大家带来帮助

      flutter-architecture-blueprints-源码.rar

      Flutter作为Google推出的一款强大的移动应用开发框架,以其高性能、跨平台的特性受到了广大开发者们的喜爱。本压缩包“flutter-architecture-blueprints-源码”为我们揭示了Flutter应用架构设计的蓝图,通过源码分析...

      Android-android-mvvm-architecture.zip

      Android-android-mvvm-architecture.zip,此存储库包含一个详细的示例应用程序,该应用程序使用dagger2、room、rxjava2、fastdroidnetworking和placeholderview实现mvvm体系结构,安卓系统是谷歌在2008年设计和制造的...

      Google TPU V3 Codesigning Architecture and Infrastructure

      Google TPU V3的出现,正是为了在机器学习和深度学习领域提供强大的计算能力。TPU V3通过一种被称为codesigning的架构设计,将硬件与软件编程模型紧密结合,使得其在处理机器学习工作负载时具有无与伦比的优势。 ...

      Computer Architecture:A Quantitative Approach(5th)

      Amazon Web Services的James Hamilton提到,只有Hennessy和Patterson才能接触到谷歌、亚马逊、微软等云服务和互联网规模应用提供商的内部人士,因此书中对这一领域的覆盖在业界中无出其右。 本书还介绍了大规模...

      Computer organization and architecture

      它将部分软件运行在PMD上,其余部分则运行在云端,亚马逊和谷歌是这方面的佼佼者。 在学习计算机组织与架构的过程中,我们将会了解程序是如何被翻译成机器语言,以及硬件是如何执行这些机器语言的。我们还将研究...

    Global site tag (gtag.js) - Google Analytics