`
sillycat
  • 浏览: 2551947 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Perl Huge XML Solution(1)Split Files and Multiple Threads

 
阅读更多
Perl Huge XML Solution(1)Split Files and Multiple Threads

1. Upgrade the Perl
>sudo yum install cpan

>sudo cpan
cpan>install Bundle::CPAN
cpan>reload cpan

cpan>upgrade
Not working with Error Message
make NO isa perl

Solution:
> sudo yum install perl-Config*

Not working to upgrade the perl, but I can install the modules one by one
cpan> install Time::Piece
cpan> install Path::Class
cpan> install autodie
cpan> install Thread::Queue

2. Split The File
split_hero.pl
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
use Path::Class;
use autodie; # die if problem reading or writing a file

my $OutputSize = 0;
my $OutputCount = 0;
my $MaxSize = 100_000_000;
my $HugeFileName = "data/728";

print localtime->strftime('%Y-%m-%d %X') . "\n";

my $out;
open(my $in, '<', $HugeFileName . '.xml') or die "input: $!\n";
while(<$in>) {
    if(!$out) {
        $OutputCount++;
        $OutputSize = 0;
        open($out, '>', $HugeFileName . "/output$OutputCount.xml") or die "output: $!\n";
        unless($OutputCount==1) {
            print $out qq{<?xml version='1.0' encoding='UTF-8'?>\n};
            print $out qq{<source>\n};
        }
    }
    print $out $_;
    $OutputSize += length($_);
    if(m|</job>|i) { #/
        if($OutputSize > $MaxSize) {
            print $out "</source>\n";
            close($out);
            $out = undef;
        }
    }
}
close($in);

my @files = glob($HugeFileName . "/*.xml");

my $dir = dir($HugeFileName);
my $list_file = $dir->file("file_list");
my $list_file_handle = $list_file->open('>>');

foreach my $file (@files) {
   $list_file_handle->print($file . "\n");
   print "$file\n";
}

print localtime->strftime('%Y-%m-%d %X') . "\n";

3. Multiple Threads on Perl
#!/usr/bin/perl

use strict;
use warnings;

use threads;
use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {

    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/sbin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
print("what is the task list = " . $input_fh . "\n");
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

I change that a little bit to call PHP
my $result = `php src/import.php 728 $server`;

4. Test Result
split Huge XML(4.5G)  on 2 cores CPU 4G memory Machine in 00:02:05
04:17:24
04:19:29

send to Redis/SQS on 2 cores CPU 4G memory Machine in 00:03:12
04:23:46
04:26:58


References:
http://sillycat.iteye.com/blog/1017590  file handler
http://sillycat.iteye.com/blog/2193773

Perl 1, 2, 3, 4, 6
http://sillycat.iteye.com/blog/1012882
http://sillycat.iteye.com/blog/1012923
http://sillycat.iteye.com/blog/1012940
http://sillycat.iteye.com/blog/1016428
http://sillycat.iteye.com/blog/1017632 string
http://sillycat.iteye.com/blog/1021197 web
http://sillycat.iteye.com/blog/1027282 queue client
http://sillycat.iteye.com/blog/1073593 browser info

Split XML File
http://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter
http://stackoverflow.com/questions/15503980/split-file-by-xml-tag
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24760607.html
https://metacpan.org/pod/XML::Twig#xml_split---cut-a-big-XML-file-into-smaller-chunks
http://code.izzid.com/2008/01/21/How-to-move-back-a-line-with-reading-a-perl-filehandle.html

Perl threads
http://stackoverflow.com/questions/26296206/perl-daemonize-with-child-daemons/26297240#26297240
http://stackoverflow.com/questions/6556976/how-to-use-perl-to-run-the-same-php-script-parallel

Perl Zip the File
http://perldoc.perl.org/IO/Compress/Zip.html
分享到:
评论

相关推荐

    Java.Threads.and.the.Concurrency.Utilities.1484216997

    Chapter 1: Threads and Runnables Chapter 2: Synchronization Chapter 3: Waiting and Notification Chapter 4: Additional Thread Capabilities Part II: Concurrency Utilities Chapter 5: Concurrency ...

    sdk2003文档 DLLs, Processes, and Threads

    sdk2003文档 DLLs, Processes, and Threads

    centos7 perl rpm依赖包

    装mysql时提示少perl,安装perl需要依赖包。已包含下面所有包, 版本号匹配。 [Linux]centOS7下RPM安装Perl 下载rpm依赖包,依照顺序安装. perl-parent-0.225-244.el7.noarch perl-...

    Unix Systems Programming Communication, Concurrency and Threads, Second Edition.chm (Unix系统编程通信,并发和线程))

    A self-contained reference that relies on the latest UNIX standards,UNIX Systems Programming provides thorough coverage of files, signals,semaphores, POSIX threads, and client-server communication....

    perl-threads-shared-1.43-6.el7.x86_64.rpm

    离线安装包,亲测可用

    操作系统英文教学课件:Chapter 4 Threads.ppt

    1. **Parallelism**: Multiple threads can execute independently, allowing for more efficient use of processor cores and improved overall throughput. 2. ** Responsiveness**: Threads can be prioritized, ...

    Unix Systems Programming Communication, Concurrency, and Threads 2003.chm

    Unix Systems Programming Communication, Concurrency, and Threads 2003.chm

    perl网络编程基础篇

    - **XML和JSON处理**:Perl有XML::Simple、XML::DOM等模块处理XML数据,JSON::XS用于解析和生成JSON格式的数据。 - **正则表达式**:Perl的正则表达式功能强大,可用于快速查找、替换和提取网络数据中的模式。 6....

    Perl入门及高级编程.rar

    Perl由Larry Wall在1987年创建,它的全称是"Practical Extraction and Reporting Language",即“实用提取和报告语言”。Perl的设计理念是结合C、sed、awk等语言的优点,提供一种高效、简洁且功能丰富的编程工具。 ...

    perl-threads-shared-1.58-2.el8.ppc64le.rpm

    离线安装包,亲测可用

    perl多线程教程集

    Perl中的线程(threads)是程序执行的基本单元,每个线程都有自己的内存空间,可以并行执行任务。线程之间共享进程的资源,如打开的文件描述符和全局变量,但拥有独立的栈,这意味着它们可以同时运行不同的代码块而...

    Perl入门及高级编程

    Perl,全称“ Practical Extraction and Reporting Language”,是一种强大的文本处理语言,尤其在系统管理、脚本编程、网络编程以及文本挖掘等领域广泛应用。本教程“Perl入门及高级编程”旨在为初学者提供一个全面...

    UNIX Systems Programming: Communication, Concurrency and Threads (2nd Edition)

    Coverage also includes files, signals, semaphores, POSIX threads, and client-server communication. The authors illustrate the best ways to write system calls, they present several hands-on projects ...

    Multiple_Threads.rar

    1. 创建一个互斥量对象,用于控制对缓冲区的访问。 2. 当生产者生成新的数据时,先获取互斥量的锁,检查缓冲区是否已满。 3. 如果未满,将数据添加到缓冲区,并释放互斥量的锁,允许其他线程访问。 4. 如果已满,则...

    Perl实例精解书中源码

    3. **模块使用**:Perl有丰富的CPAN(Comprehensive Perl Archive Network)库,书中可能涉及一些常用模块的使用,如LWP(用于Web请求)、DBI(数据库接口)或XML::Parser(处理XML文档)。 4. **网络编程**:Perl...

    Network Programming With Perl

    1. **Perl简介**:Perl是一种通用、多用途的脚本编程语言,特别适合文本处理和系统管理任务。Perl的灵活性和强大的字符串处理能力使其在网络编程中具有广泛的应用。 2. **网络基础知识**:书中会讲解TCP/IP协议栈的...

    Perl语言编程(清晰完整)

    它的名字“Perl”是“Practical Extraction and Reporting Language”的首字母缩写,最初是为了文本处理和报告生成而创建的。随着时间的发展,Perl逐渐发展成为一个功能强大的多用途语言,被广泛用于系统管理、网络...

    perl实现多线程详解[整理].pdf

    在 Perl 语言中,使用 threads 包可以实现多线程编程。threads 包提供了多种方法来创建和管理线程,包括创建线程、等待线程、detach 线程、获取线程列表等。使用 threads 包可以方便地实现多线程编程,但需要注意...

Global site tag (gtag.js) - Google Analytics