- 浏览: 584852 次
- 性别:
- 来自: 广州
文章分类
- 全部博客 (365)
- Tomcat调优 (2)
- Apache Http (20)
- Webserver安装 (5)
- Linux安装 (28)
- Linux常用命令 (17)
- C语言及网络编程 (10)
- 文件系统 (0)
- Lucene (12)
- Hadoop (9)
- FastDFS (8)
- 报表 (0)
- 性能测试 (1)
- JAVA (18)
- CSharp (3)
- C++ (38)
- BI (0)
- 数据挖掘 (0)
- 数据采集 (0)
- 网址收集整理 (3)
- Resin (0)
- JBoss (0)
- nginx (0)
- 数据结构 (1)
- 随记 (5)
- Katta (1)
- Shell (6)
- webservice (0)
- JBPM (2)
- JQuery (6)
- Flex (41)
- SSH (0)
- javascript (7)
- php (13)
- 数据库 (6)
- 搜索引擎排序 (2)
- LVS (3)
- solr (2)
- windows (1)
- mysql (3)
- 营销软件 (1)
- tfs (1)
- memcache (5)
- 分布式搜索 (3)
- 关注的博客 (1)
- Android (2)
- clucene (11)
- 综合 (1)
- c c++ 多线程 (6)
- Linux (1)
- 注册码 (1)
- 文件类型转换 (3)
- Linux 与 asp.net (2)
- perl (5)
- coreseek (1)
- 阅读器 (2)
- SEO (1)
- 励志 (1)
- 在线性能测试工具 (1)
- yii (7)
- 服务器监控 (1)
- 广告 (1)
- 代理服务 (5)
- zookeeper (8)
- 广告联盟 (0)
- 常用软件下载 (1)
- 架设自已的站点心得 (0)
最新评论
-
terry07:
java 7 用这个就可以了 Desktop desktop ...
关于java Runtime.getRunTime.exec(String command)的使用 -
HSINKING:
怎么设置打开的dos 窗口是指定的路径下
关于java调用bat文件,不打开窗口 -
liubang201010:
hyperic hq更多参考资料,请访问:http://www ...
hyperic-hq -
^=^:
STDIN_FILENO是unistd.h中定义的一个numb ...
深入理解dup和dup2的用法 -
antor:
留个记号,学习了
[转]用java流方式判断文件类型
%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%\System32\WindowsPowerShell\v1.0\;D:\Program Files\Microsoft SQL Server\90\Tools\binn\;D:\Java\jdk1.6.0\bin;K:\cygwinnew\bin;D:\Program Files\Adobe\Flex Builder 3\sdks\3.2.0\bin;D:\MinGW\bin;D:\Program Files\Microsoft SQL Server\100\Tools\Binn\;D:\Program Files\Microsoft SQL Server\100\DTS\Binn\;D:\Program Files\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\;D:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;D:\Program Files\TortoiseSVN\bin;E:\xpdf\chinese-simplified;E:\xpdf\chinese-simplified\CMap
authorized_keys
/cygdrive/D/Java/jdk1.6.0
/cygdrive/D/tmp/testdata/input
/cygdrive/D/tmp/testoutput
D:\tmp\testoutput
hadoop namenode -formate
D:\tmp\testdata\input
上传 input下的文件到 dfs中的input文件中
$ ./bin/hadoop fs -put D:/tmp/testdata/input input
jar hadoop-0.20.1-examples.jar wordcount input/input output-dir ,其中hadoop-0.16.4-examples.jar
$ ./bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input/input output-di
r
11/12/28 17:39:40 INFO input.FileInputFormat: Total input paths to process : 3
11/12/28 17:39:41 INFO mapred.JobClient: Running job: job_201112281720_0003
11/12/28 17:39:42 INFO mapred.JobClient: map 0% reduce 0%
11/12/28 17:39:51 INFO mapred.JobClient: map 66% reduce 0%
11/12/28 17:39:54 INFO mapred.JobClient: map 100% reduce 0%
11/12/28 17:40:03 INFO mapred.JobClient: map 100% reduce 100%
11/12/28 17:40:05 INFO mapred.JobClient: Job complete: job_201112281720_0003
11/12/28 17:40:05 INFO mapred.JobClient: Counters: 17
11/12/28 17:40:05 INFO mapred.JobClient: Job Counters
11/12/28 17:40:05 INFO mapred.JobClient: Launched reduce tasks=1
11/12/28 17:40:05 INFO mapred.JobClient: Launched map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: Data-local map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: FileSystemCounters
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_READ=290
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=607
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
11/12/28 17:40:05 INFO mapred.JobClient: Map-Reduce Framework
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input groups=0
11/12/28 17:40:05 INFO mapred.JobClient: Combine output records=16
11/12/28 17:40:05 INFO mapred.JobClient: Map input records=3
11/12/28 17:40:05 INFO mapred.JobClient: Reduce shuffle bytes=221
11/12/28 17:40:05 INFO mapred.JobClient: Reduce output records=0
11/12/28 17:40:05 INFO mapred.JobClient: Spilled Records=32
11/12/28 17:40:05 INFO mapred.JobClient: Map output bytes=284
11/12/28 17:40:05 INFO mapred.JobClient: Combine input records=30
11/12/28 17:40:05 INFO mapred.JobClient: Map output records=30
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input records=16
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths input/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 2 -numMapTasks 2 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input(dfs文件系统中的目录) -outputPath index_msg_out_010(dfs文件系统中的目录) -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
11/12/29 10:05:20 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:05:20 INFO main.UpdateIndex: outputPath = index_msg_out_010
11/12/29 10:05:20 INFO main.UpdateIndex: shards = null
11/12/29 10:05:20 INFO main.UpdateIndex: indexPath = index_030
11/12/29 10:05:20 INFO main.UpdateIndex: numShards = 1
11/12/29 10:05:20 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:05:20 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:05:21 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop.c
ontrib.index.mapred.IndexUpdater
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhost:
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localhost
:18888/user/kelo-dichan/administrator/index_msg_out_010
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: 1 shards = -1@index_030/00000@-1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.format.class = org.apac
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:05:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing th
e arguments. Applications should implement Tool for the same.
11/12/29 10:05:21 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/29 10:05:23 INFO mapred.JobClient: Running job: job_201112281720_0005
11/12/29 10:05:24 INFO mapred.JobClient: map 0% reduce 0%
运行成功
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/i
dex-config.xml
11/12/29 10:17:12 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:17:12 INFO main.UpdateIndex: outputPath = index_msg_out_012
11/12/29 10:17:12 INFO main.UpdateIndex: shards = null
11/12/29 10:17:12 INFO main.UpdateIndex: indexPath = index_032
11/12/29 10:17:12 INFO main.UpdateIndex: numShards = 1
11/12/29 10:17:12 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:17:12 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:17:13 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop
ontrib.index.mapred.IndexUpdater
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhos
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localho
:18888/user/kelo-dichan/administrator/index_msg_out_012
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: 1 shards = -1@index_032/00000@-1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.format.class = org.ap
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:17:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing
e arguments. Applications should implement Tool for the same.
11/12/29 10:17:13 INFO mapred.FileInputFormat: Total input paths to process :
11/12/29 10:17:13 INFO mapred.JobClient: Running job: job_201112291014_0002
11/12/29 10:17:14 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 10:17:22 INFO mapred.JobClient: map 66% reduce 0%
11/12/29 10:17:25 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 10:17:32 INFO mapred.JobClient: map 100% reduce 22%
11/12/29 10:17:38 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 10:17:40 INFO mapred.JobClient: Job complete: job_201112291014_0002
11/12/29 10:17:40 INFO mapred.JobClient: Counters: 18
11/12/29 10:17:40 INFO mapred.JobClient: Job Counters
11/12/29 10:17:40 INFO mapred.JobClient: Launched reduce tasks=1
11/12/29 10:17:40 INFO mapred.JobClient: Launched map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: FileSystemCounters
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_READ=2104
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2910
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=622
11/12/29 10:17:40 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input groups=1
11/12/29 10:17:40 INFO mapred.JobClient: Combine output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce shuffle bytes=892
11/12/29 10:17:40 INFO mapred.JobClient: Reduce output records=1
11/12/29 10:17:40 INFO mapred.JobClient: Spilled Records=6
11/12/29 10:17:40 INFO mapred.JobClient: Map output bytes=1302
11/12/29 10:17:40 INFO mapred.JobClient: Map input bytes=161
11/12/29 10:17:40 INFO mapred.JobClient: Combine input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input records=3
11/12/29 10:17:40 INFO main.UpdateIndex: Index update job is done
11/12/29 10:17:40 INFO main.UpdateIndex: Elapsed time is 27s
Elapsed time is 27s
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* D:\tmp\testou
tput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* /cygdrive/D/tmp/testoutput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$
//取出hdfs文件系统中的目录下的所有文件
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_032/00000/ D:/tmp/testoutput
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_035/00000/ D:/tmp/testoutput
分布式索引简单总结
1、命令如下
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
参数说明
路径指的是hdfs分布式系统中的路径
-numShards
-numMapTasks
这两个数值不一样,如都为3时(输入路径中有3个文件),则索引结果少了数据(文档数据少了),暂不知道原因
如果改成组合文件,为是什么样呢
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPat
ndex_msg_out_020 -indexPath index_040 -numShards 3 -numMapTasks 3 -conf conf
dex-config.xml
11/12/29 11:39:19 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 11:39:19 INFO main.UpdateIndex: outputPath = index_msg_out_020
11/12/29 11:39:19 INFO main.UpdateIndex: shards = null
11/12/29 11:39:19 INFO main.UpdateIndex: indexPath = index_040
11/12/29 11:39:19 INFO main.UpdateIndex: numShards = 3
11/12/29 11:39:19 INFO main.UpdateIndex: numMapTasks= 3
11/12/29 11:39:19 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 11:39:20 INFO main.UpdateIndex: sea.index.updater = org.apache.hado
ontrib.index.mapred.IndexUpdater
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localh
18888/user/kelo-dichan/administrator/input/input
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://local
:18888/user/kelo-dichan/administrator/index_msg_out_020
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.map.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.reduce.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: 3 shards = -1@index_040/00000@-1
index_040/00001@-1,-1@index_040/00002@-1
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.format.class = org.
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 11:39:20 WARN mapred.JobClient: Use GenericOptionsParser for parsin
e arguments. Applications should implement Tool for the same.
11/12/29 11:39:20 INFO mapred.FileInputFormat: Total input paths to process
11/12/29 11:39:20 INFO mapred.JobClient: Running job: job_201112291106_0009
11/12/29 11:39:21 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 11:39:30 INFO mapred.JobClient: map 33% reduce 0%
11/12/29 11:39:34 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 11:39:40 INFO mapred.JobClient: map 100% reduce 7%
11/12/29 11:39:43 INFO mapred.JobClient: map 100% reduce 14%
11/12/29 11:39:46 INFO mapred.JobClient: map 100% reduce 40%
11/12/29 11:39:49 INFO mapred.JobClient: map 100% reduce 66%
11/12/29 11:39:52 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 11:39:54 INFO mapred.JobClient: Job complete: job_201112291106_0009
11/12/29 11:39:54 INFO mapred.JobClient: Counters: 18
11/12/29 11:39:54 INFO mapred.JobClient: Job Counters
11/12/29 11:39:54 INFO mapred.JobClient: Launched reduce tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Launched map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: FileSystemCounters
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_READ=2648
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3279
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1025
11/12/29 11:39:54 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input groups=2
11/12/29 11:39:54 INFO mapred.JobClient: Combine output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce shuffle bytes=948
11/12/29 11:39:54 INFO mapred.JobClient: Reduce output records=2
11/12/29 11:39:54 INFO mapred.JobClient: Spilled Records=6
11/12/29 11:39:54 INFO mapred.JobClient: Map output bytes=1350
11/12/29 11:39:54 INFO mapred.JobClient: Map input bytes=161
11/12/29 11:39:54 INFO mapred.JobClient: Combine input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input records=3
11/12/29 11:39:54 INFO main.UpdateIndex: Index update job is done
11/12/29 11:39:54 INFO main.UpdateIndex: Elapsed time is 33s
Elapsed time is 33s
文件系统是hadoop 分布式文件系统
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8888</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
文件系统是本地,这种方式也很好 core_site.xml
<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
以下是本地文件系统使用例字 testdata/input 本地的相对目录(全路径是D:/hadoop/run/testdata/input D:/hadoop/run是安装路径)
jar hadoop-0.20.1-examples.jar wordcount testdata/input output-dir1
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths testdata/input -outputPath index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
hadoop-0.20.2
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input/input output-di
/cygdrive/d/tmp/testdata/input
在eclipse 中使用mapreduce
环境配置所需要的
eclipse 3.3
hadoop 0.20.2 中的hadoop-0.20.2-eclipse-plugin.jar
调式脚本启动(运行如下脚本后,在eclipse调试同一个程序,并使用远程调试方式(可配置))
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
Listening for transport dt_socket at address: 28888
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not
exist: hdfs://127.0.0.1:8888/user/kelo-dichan/administrator/input/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File
InputFormat.java:224)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI
nputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
79)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
mDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
详细步
1、参考
1、先在win7下配置好hadoop一般可使用
2、然后把bin/hadoop 脚本copy一份,重新命名,叫hddebug
3、并在hddebug中增加一行
如下 即在 if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then增加
HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,address=28888,server=y,suspend=y"
4、运行
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
可看到 Listening for transport dt_socket at address: 28888
5、启动eclipse 调试wordcount这个代码
菜单,调试-设置成远程调试即可进行调试了
authorized_keys
/cygdrive/D/Java/jdk1.6.0
/cygdrive/D/tmp/testdata/input
/cygdrive/D/tmp/testoutput
D:\tmp\testoutput
hadoop namenode -formate
D:\tmp\testdata\input
上传 input下的文件到 dfs中的input文件中
$ ./bin/hadoop fs -put D:/tmp/testdata/input input
jar hadoop-0.20.1-examples.jar wordcount input/input output-dir ,其中hadoop-0.16.4-examples.jar
$ ./bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input/input output-di
r
11/12/28 17:39:40 INFO input.FileInputFormat: Total input paths to process : 3
11/12/28 17:39:41 INFO mapred.JobClient: Running job: job_201112281720_0003
11/12/28 17:39:42 INFO mapred.JobClient: map 0% reduce 0%
11/12/28 17:39:51 INFO mapred.JobClient: map 66% reduce 0%
11/12/28 17:39:54 INFO mapred.JobClient: map 100% reduce 0%
11/12/28 17:40:03 INFO mapred.JobClient: map 100% reduce 100%
11/12/28 17:40:05 INFO mapred.JobClient: Job complete: job_201112281720_0003
11/12/28 17:40:05 INFO mapred.JobClient: Counters: 17
11/12/28 17:40:05 INFO mapred.JobClient: Job Counters
11/12/28 17:40:05 INFO mapred.JobClient: Launched reduce tasks=1
11/12/28 17:40:05 INFO mapred.JobClient: Launched map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: Data-local map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: FileSystemCounters
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_READ=290
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=607
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
11/12/28 17:40:05 INFO mapred.JobClient: Map-Reduce Framework
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input groups=0
11/12/28 17:40:05 INFO mapred.JobClient: Combine output records=16
11/12/28 17:40:05 INFO mapred.JobClient: Map input records=3
11/12/28 17:40:05 INFO mapred.JobClient: Reduce shuffle bytes=221
11/12/28 17:40:05 INFO mapred.JobClient: Reduce output records=0
11/12/28 17:40:05 INFO mapred.JobClient: Spilled Records=32
11/12/28 17:40:05 INFO mapred.JobClient: Map output bytes=284
11/12/28 17:40:05 INFO mapred.JobClient: Combine input records=30
11/12/28 17:40:05 INFO mapred.JobClient: Map output records=30
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input records=16
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths input/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 2 -numMapTasks 2 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input(dfs文件系统中的目录) -outputPath index_msg_out_010(dfs文件系统中的目录) -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
11/12/29 10:05:20 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:05:20 INFO main.UpdateIndex: outputPath = index_msg_out_010
11/12/29 10:05:20 INFO main.UpdateIndex: shards = null
11/12/29 10:05:20 INFO main.UpdateIndex: indexPath = index_030
11/12/29 10:05:20 INFO main.UpdateIndex: numShards = 1
11/12/29 10:05:20 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:05:20 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:05:21 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop.c
ontrib.index.mapred.IndexUpdater
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhost:
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localhost
:18888/user/kelo-dichan/administrator/index_msg_out_010
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: 1 shards = -1@index_030/00000@-1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.format.class = org.apac
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:05:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing th
e arguments. Applications should implement Tool for the same.
11/12/29 10:05:21 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/29 10:05:23 INFO mapred.JobClient: Running job: job_201112281720_0005
11/12/29 10:05:24 INFO mapred.JobClient: map 0% reduce 0%
运行成功
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/i
dex-config.xml
11/12/29 10:17:12 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:17:12 INFO main.UpdateIndex: outputPath = index_msg_out_012
11/12/29 10:17:12 INFO main.UpdateIndex: shards = null
11/12/29 10:17:12 INFO main.UpdateIndex: indexPath = index_032
11/12/29 10:17:12 INFO main.UpdateIndex: numShards = 1
11/12/29 10:17:12 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:17:12 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:17:13 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop
ontrib.index.mapred.IndexUpdater
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhos
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localho
:18888/user/kelo-dichan/administrator/index_msg_out_012
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: 1 shards = -1@index_032/00000@-1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.format.class = org.ap
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:17:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing
e arguments. Applications should implement Tool for the same.
11/12/29 10:17:13 INFO mapred.FileInputFormat: Total input paths to process :
11/12/29 10:17:13 INFO mapred.JobClient: Running job: job_201112291014_0002
11/12/29 10:17:14 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 10:17:22 INFO mapred.JobClient: map 66% reduce 0%
11/12/29 10:17:25 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 10:17:32 INFO mapred.JobClient: map 100% reduce 22%
11/12/29 10:17:38 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 10:17:40 INFO mapred.JobClient: Job complete: job_201112291014_0002
11/12/29 10:17:40 INFO mapred.JobClient: Counters: 18
11/12/29 10:17:40 INFO mapred.JobClient: Job Counters
11/12/29 10:17:40 INFO mapred.JobClient: Launched reduce tasks=1
11/12/29 10:17:40 INFO mapred.JobClient: Launched map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: FileSystemCounters
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_READ=2104
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2910
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=622
11/12/29 10:17:40 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input groups=1
11/12/29 10:17:40 INFO mapred.JobClient: Combine output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce shuffle bytes=892
11/12/29 10:17:40 INFO mapred.JobClient: Reduce output records=1
11/12/29 10:17:40 INFO mapred.JobClient: Spilled Records=6
11/12/29 10:17:40 INFO mapred.JobClient: Map output bytes=1302
11/12/29 10:17:40 INFO mapred.JobClient: Map input bytes=161
11/12/29 10:17:40 INFO mapred.JobClient: Combine input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input records=3
11/12/29 10:17:40 INFO main.UpdateIndex: Index update job is done
11/12/29 10:17:40 INFO main.UpdateIndex: Elapsed time is 27s
Elapsed time is 27s
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* D:\tmp\testou
tput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* /cygdrive/D/tmp/testoutput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$
//取出hdfs文件系统中的目录下的所有文件
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_032/00000/ D:/tmp/testoutput
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_035/00000/ D:/tmp/testoutput
分布式索引简单总结
1、命令如下
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
参数说明
路径指的是hdfs分布式系统中的路径
-numShards
-numMapTasks
这两个数值不一样,如都为3时(输入路径中有3个文件),则索引结果少了数据(文档数据少了),暂不知道原因
如果改成组合文件,为是什么样呢
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPat
ndex_msg_out_020 -indexPath index_040 -numShards 3 -numMapTasks 3 -conf conf
dex-config.xml
11/12/29 11:39:19 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 11:39:19 INFO main.UpdateIndex: outputPath = index_msg_out_020
11/12/29 11:39:19 INFO main.UpdateIndex: shards = null
11/12/29 11:39:19 INFO main.UpdateIndex: indexPath = index_040
11/12/29 11:39:19 INFO main.UpdateIndex: numShards = 3
11/12/29 11:39:19 INFO main.UpdateIndex: numMapTasks= 3
11/12/29 11:39:19 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 11:39:20 INFO main.UpdateIndex: sea.index.updater = org.apache.hado
ontrib.index.mapred.IndexUpdater
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localh
18888/user/kelo-dichan/administrator/input/input
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://local
:18888/user/kelo-dichan/administrator/index_msg_out_020
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.map.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.reduce.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: 3 shards = -1@index_040/00000@-1
index_040/00001@-1,-1@index_040/00002@-1
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.format.class = org.
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 11:39:20 WARN mapred.JobClient: Use GenericOptionsParser for parsin
e arguments. Applications should implement Tool for the same.
11/12/29 11:39:20 INFO mapred.FileInputFormat: Total input paths to process
11/12/29 11:39:20 INFO mapred.JobClient: Running job: job_201112291106_0009
11/12/29 11:39:21 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 11:39:30 INFO mapred.JobClient: map 33% reduce 0%
11/12/29 11:39:34 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 11:39:40 INFO mapred.JobClient: map 100% reduce 7%
11/12/29 11:39:43 INFO mapred.JobClient: map 100% reduce 14%
11/12/29 11:39:46 INFO mapred.JobClient: map 100% reduce 40%
11/12/29 11:39:49 INFO mapred.JobClient: map 100% reduce 66%
11/12/29 11:39:52 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 11:39:54 INFO mapred.JobClient: Job complete: job_201112291106_0009
11/12/29 11:39:54 INFO mapred.JobClient: Counters: 18
11/12/29 11:39:54 INFO mapred.JobClient: Job Counters
11/12/29 11:39:54 INFO mapred.JobClient: Launched reduce tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Launched map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: FileSystemCounters
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_READ=2648
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3279
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1025
11/12/29 11:39:54 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input groups=2
11/12/29 11:39:54 INFO mapred.JobClient: Combine output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce shuffle bytes=948
11/12/29 11:39:54 INFO mapred.JobClient: Reduce output records=2
11/12/29 11:39:54 INFO mapred.JobClient: Spilled Records=6
11/12/29 11:39:54 INFO mapred.JobClient: Map output bytes=1350
11/12/29 11:39:54 INFO mapred.JobClient: Map input bytes=161
11/12/29 11:39:54 INFO mapred.JobClient: Combine input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input records=3
11/12/29 11:39:54 INFO main.UpdateIndex: Index update job is done
11/12/29 11:39:54 INFO main.UpdateIndex: Elapsed time is 33s
Elapsed time is 33s
文件系统是hadoop 分布式文件系统
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8888</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
文件系统是本地,这种方式也很好 core_site.xml
<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
以下是本地文件系统使用例字 testdata/input 本地的相对目录(全路径是D:/hadoop/run/testdata/input D:/hadoop/run是安装路径)
jar hadoop-0.20.1-examples.jar wordcount testdata/input output-dir1
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths testdata/input -outputPath index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
hadoop-0.20.2
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input/input output-di
/cygdrive/d/tmp/testdata/input
在eclipse 中使用mapreduce
环境配置所需要的
eclipse 3.3
hadoop 0.20.2 中的hadoop-0.20.2-eclipse-plugin.jar
调式脚本启动(运行如下脚本后,在eclipse调试同一个程序,并使用远程调试方式(可配置))
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
Listening for transport dt_socket at address: 28888
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not
exist: hdfs://127.0.0.1:8888/user/kelo-dichan/administrator/input/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File
InputFormat.java:224)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI
nputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
79)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
mDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
详细步
1、参考
1、先在win7下配置好hadoop一般可使用
2、然后把bin/hadoop 脚本copy一份,重新命名,叫hddebug
3、并在hddebug中增加一行
如下 即在 if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then增加
HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,address=28888,server=y,suspend=y"
4、运行
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
可看到 Listening for transport dt_socket at address: 28888
5、启动eclipse 调试wordcount这个代码
菜单,调试-设置成远程调试即可进行调试了
发表评论
-
hadoop 索引相关记录
2012-01-10 17:21 910hadoop 分布式索引升级包 https://issues. ... -
windows hadoop 调试环境
2011-12-31 13:45 597eclipse 3.3 hadoop 0.20.2 中的h ... -
hadoop主节点(NameNode)备份策略以及恢复方法
2011-12-30 17:40 0hadoop主节点(NameNode)备份策略以及恢复方法 ... -
hadoop job提交完成的整个过程介绍 zz
2011-12-30 16:52 15922009-11-17 11:16http://blog.chi ... -
Hadoop中常出现的错误以及解决方法zz
2011-12-30 16:52 873引用2009-11-18 15:421:Shuff ... -
分布式 Lucene
2011-12-27 13:54 755http://www.hadooper.cn/dct/page ... -
使用Eclipse3.4编译部署Hadoop/Hbase工程时需要修正的BUG(转)
2011-06-09 19:52 1442引用Posted in Java, FreeBSD/Unix服 ... -
hadoop 0.20.1在 windows下编译
2010-11-12 09:18 1310必备条件 1\ant 2\cygwin 3\在eclipse ... -
Avro总结(RPC/序列化)
2010-10-20 17:04 1562Avro(读音类似于[ævr ...
相关推荐
Hadoop知识点笔记 Hadoop是一种基于分布式计算的数据处理框架,由 Doug Cutting 和 Mike Cafarella 于2005年创建。Hadoop的主要组件包括HDFS(Hadoop Distributed File System)、YARN(Yet Another Resource ...
2. **日志处理**:很多网站和应用程序使用Hadoop处理海量的日志数据,以提取有价值的信息。 3. **搜索引擎**:Hadoop可以用来构建大规模的搜索引擎后端,如处理网页索引、排名算法等。 4. **科学计算**:在科学研究...
### Spark + Hadoop + MLlib 及相关概念与操作笔记 #### 一、调研相关注意事项 **理解调研** 调研的本质在于深入了解当前的技术环境、业务需求或是特定领域内的技术细节,以便于发现潜在的问题和挑战,并据此提出...
这个《核心知识篇(上半季)》,其实主要还是打基础,包括核心的原理,还有核心的操作,还有部分高级的技术和操作,大量的实验,大量的画图,最后初步讲解怎么使用java api 《核心知识篇(下半季)》,包括深度讲解...
那么这一次,我在已经初步阅读过MapReduce提交Job源码的基础上,根据【大数据入门笔记系列】第五小节SpringBoot集成hadoop开发环境(复杂版的WordCount)做出来的环境,通过Debug的方式来跟一下整个Job提交流程。...
作者在个人笔记中也提到了自己的学习背景和对机器学习的一些初步认识,表明了这是一份入门学习者的笔记,其中可能存在理解或表述上的不准确,因此读者在参考时需要谨慎。同时,作者还提供了一些实际操作经验和研究...
可能介绍了如何使用数据分析工具(如Apache Spark、Hadoop或NoSQL数据库)进行实时或批量数据处理,以及如何通过机器学习模型挖掘数据价值。 6. **云平台集成**:物联网设备通常与云服务集成,例如AWS IoT、Azure ...
OpenNotes自己整理,总结的...11 锈2018 Centos 7 MySQL Redis MongoDB 码头工人大数据Hadoop 2.6.5 MapReduce 蜂巢1.2.1 HBase 0.98 卡夫卡2.10语言特性初步编程鸭子模型猴子补丁赛顿Pythonic泛型模版生命周期所有权
通过学习,学生应能掌握大数据处理的关键技术和方法,理解大数据分析的流程,并具备初步的大数据项目实施能力。 大数据是指那些传统数据处理工具无法有效管理的海量、高增长速度和多样性的数据资源。它涉及到多个...
4. 数据探索:使用统计方法和可视化工具对数据进行初步分析,寻找模式和趋势。 5. 数据建模:应用机器学习算法进行预测、分类或聚类分析。 6. 结果解释:将模型的结果转化为业务洞察,帮助决策制定。 通过Jupyter ...
3. **大数据处理工具**:可能涉及到Hadoop、Spark等分布式计算框架,用于处理大规模数据集。 4. **数据分析方法**:涵盖回归分析、聚类、分类算法,如线性回归、K-means、决策树等。 5. **数据库管理**:SQL语言的...