`
standalone
  • 浏览: 611403 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Hadoop Research Topics

阅读更多

 

Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what Hadoop feature can I work on?

 

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.

  • Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
  • Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
  • Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
  • Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
  • Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
  • Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
  • High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
  • Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.

The first thing for a student who wants to do any of these projects is to download the code from HDFS andMAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

分享到:
评论

相关推荐

    hadoop winutils hadoop.dll

    Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在普通硬件上高效处理大量数据。在Windows环境下,Hadoop的使用与Linux有所不同,因为它的设计最初是针对Linux操作系统的。"winutils"和"hadoop.dll...

    hadoop2.7.3 Winutils.exe hadoop.dll

    在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是这个框架的一个稳定版本,它包含了多个改进和优化,以提高性能和稳定性。在这个版本中,Winutils.exe和hadoop.dll是两...

    hadoop的dll文件 hadoop.zip

    Hadoop是一个开源的分布式计算框架,由Apache基金会开发,它主要设计用于处理和存储大量数据。在提供的信息中,我们关注的是"Hadoop的dll文件",这是一个动态链接库(DLL)文件,通常在Windows操作系统中使用,用于...

    hadoop的hadoop.dll和winutils.exe下载

    在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说,它们在本地开发和运行Hadoop相关应用时必不可少。`hadoop.dll`是一个动态链接库文件,主要用于在Windows环境中提供...

    hadoop.dll & winutils.exe For hadoop-2.7.1

    在大数据处理领域,Hadoop是一个不可或缺的开源框架,它提供了分布式存储和计算的能力。本文将详细探讨与"Hadoop.dll"和"winutils.exe"相关的知识点,以及它们在Hadoop-2.7.1版本中的作用。 Hadoop.dll是Hadoop在...

    hadoop2.7.3的hadoop.dll和winutils.exe

    在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是Hadoop发展中的一个重要版本,它包含了众多的优化和改进,旨在提高性能、稳定性和易用性。在这个版本中,`hadoop.dll`...

    Hadoop下载 hadoop-2.9.2.tar.gz

    Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo 的工程师 Doug Cutting 和 Mike Cafarella Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo...

    hadoop2.7.7对应的hadoop.dll,winutils.exe

    在Hadoop生态系统中,Hadoop 2.7.7是一个重要的版本,它为大数据处理提供了稳定性和性能优化。Hadoop通常被用作Linux环境下的分布式计算框架,但有时开发者或学习者在Windows环境下也需要进行Hadoop相关的开发和测试...

    Hadoop下载 hadoop-3.3.3.tar.gz

    Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进 Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不...

    hadoop.dll & winutils.exe For hadoop-2.6.0

    在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说。本文将详细介绍这两个文件以及它们在Hadoop 2.6.0版本中的作用。 `hadoop.dll`是Hadoop在Windows环境下运行所必需的一...

    hadoop2.6 hadoop.dll+winutils.exe

    标题 "hadoop2.6 hadoop.dll+winutils.exe" 提到的是Hadoop 2.6版本中的两个关键组件:`hadoop.dll` 和 `winutils.exe`,这两个组件对于在Windows环境中配置和运行Hadoop至关重要。Hadoop原本是为Linux环境设计的,...

    win环境 hadoop 3.1.0安装包

    在Windows环境下安装Hadoop 3.1.0是学习和使用大数据处理技术的重要步骤。Hadoop是一个开源框架,主要用于分布式存储和处理大规模数据集。在这个过程中,我们将详细讲解Hadoop 3.1.0在Windows上的安装过程以及相关...

    各个版本Hadoop,hadoop.dll以及winutils.exe文件下载大合集

    在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。它是由Apache软件基金会开发并维护的,旨在实现高效、可扩展的数据处理能力。Hadoop的核心由两个主要组件构成:Hadoop Distributed ...

    Hadoop3.1.3.rar

    Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它的核心设计是处理和存储大量数据的能力。这个名为"Hadoop3.1.3.rar"的压缩包文件包含了Hadoop 3.1.3版本的所有组件和相关文件,使得用户可以下载并进行...

    hadoop2.7.4 hadoop.dll包括winutils.exe

    Hadoop是Apache软件基金会开发的一个开源分布式计算框架,主要由HDFS(Hadoop Distributed File System)和MapReduce两大部分组成,旨在提供一种可靠、可扩展、高效的数据处理和存储解决方案。在标题中提到的...

    winutils+hadoop.dll+eclipse插件(hadoop2.7)

    在Hadoop生态系统中,`winutils.exe`和`hadoop.dll`是Windows环境下运行Hadoop必备的组件,尤其对于开发和测试环境来说至关重要。这里我们深入探讨这两个组件以及与Eclipse插件的相关性。 首先,`winutils.exe`是...

    hadoop2.6.0插件+64位winutils+hadoop.dll

    在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop2.6.0是这个框架的一个重要版本,它包含了多项优化和改进,以提高系统的稳定性和性能。在这个压缩包中,我们关注的是与Windows...

    hadoop2.7.3 hadoop.dll

    在windows环境下开发hadoop时,需要配置HADOOP_HOME环境变量,变量值D:\hadoop-common-2.7.3-bin-master,并在Path追加%HADOOP_HOME%\bin,有可能出现如下错误: org.apache.hadoop.io.nativeio.NativeIO$Windows....

    hadoop环境缺少的hadoop.dll ,winutils.exe包

    在搭建Hadoop环境的过程中,经常会遇到一些特定的依赖问题,比如缺少`hadoop.dll`和`winutils.exe`这两个关键组件。本文将详细介绍这两个文件及其在Hadoop生态系统中的作用,以及如何解决它们缺失的问题。 首先,`...

Global site tag (gtag.js) - Google Analytics