`
roger51
  • 浏览: 26803 次
  • 性别: Icon_minigender_1
社区版块
存档分类
最新评论

pdf word xls parser 效率

阅读更多

各位好:在javaeye好长时间了,一直在各大网站学习各位的经验很感谢各位,目前我遇到一个关于lucene索引的问题,在国内和国外的网站上找了很久也没找到一个比较满意的解决办法,所以在这里想问问大家,希望有过这方面的经验的朋友给些帮助,最好能有些比较好的代码或可行性建议,我的代码大概如下

import com.messagesolution.message.viewer.util.HtmlDocument;
import com.messagesolution.util.logger.Logger;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.util.PDFTextStripper;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.textmining.text.extraction.WordExtractor;

import java.io.*;


public class DocumentConverter
{

 public static boolean convertPDF(String fromfile, String tofile)
    {
        PDFParser parser = null;
        String s = null;
        FileInputStream in = null;
        FileOutputStream fos = null;
        //BufferedOutputStream bos = null;
        DataOutputStream dos = null;
        try
        {
            try {
    PDFTextStripper _stripper = new PDFTextStripper();
    in = new FileInputStream(new File(fromfile));
    parser = new PDFParser(in);
    parser.parse();
    s = _stripper.getText(parser.getDocument());
    if (StringToolKit.isEmpty(s)){
     Logger.getInstance().error("read string of pdf is empty");
     
     return false;       //nothing to write
    }
       
   } catch (Exception e) {
    Logger.getInstance().error("read pdf or convert it error");
    e.printStackTrace();
    return false;
   }

            try {
    //now write this string to a file
    fos = new FileOutputStream(new File(tofile));
    //bos = new BufferedOutputStream(fos);
    //bos.write(s.getBytes());  //what about other language?
    dos = new DataOutputStream(fos);
    dos.writeBytes(s);
   } catch (Exception e) {
    Logger.getInstance().error("write converted txt error");
    e.printStackTrace();
    return false;
   }
        }
        catch (Throwable t)
        {
            if (t instanceof OutOfMemoryError)
                Logger.getInstance().fatal("OutOfMemoryError occurred in convertPDF for file: " + fromfile, t);
            System.err.println("Exception occurred in convertPDF, t: " + t);
            t.printStackTrace();
            return false;   //something wrong during the conversion
        }
        finally {
            try
            {
                if (parser != null)
                    parser.getDocument().close();
                if (in != null)
                    in.close();
                if (fos != null)
                    fos.close();
                //if (bos != null)
                //    bos.close();
                if (dos != null)
                    dos.close();
            } catch (Exception ex) {
              Logger.getInstance().error(ex.toString());
            }
        }

  return true;
 }

 public static boolean convertDOC(String fromfile, String tofile)
    {
        FileInputStream fis = null;
        FileOutputStream fos = null;
        DataOutputStream dos = null;

        try
        {
            fis = new FileInputStream (new File(fromfile));
            WordExtractor extractor = new WordExtractor();
            String s = extractor.extractText(fis);

            //now write this string to a file
            fos = new FileOutputStream(new File(tofile));
            //bos = new BufferedOutputStream(fos);
            //bos.write(s.getBytes());  //what about other language?
            dos = new DataOutputStream(fos);
            dos.writeBytes(s);
        }
        catch (Throwable t)
        {
            if (t instanceof OutOfMemoryError)
                Logger.getInstance().fatal("OutOfMemoryError occurred in convertDOC for file: " + fromfile, t);
            System.err.println("Exception occurred in convertDOC, t: " + t);
            t.printStackTrace();
            return false;   //something wrong during the conversion
        }
        finally
        {
            try
            {
                if (fis != null)
                    fis.close();
                if (fos != null)
                    fos.close();
                if (dos != null)
                    dos.close();
            } catch (Exception e) {}
        }

  return true;
 }

 public static boolean convertHTML(String fromfile, String tofile)
    {
        try
        {
            String htmlCharset = HtmlDocument.convertHtml(fromfile, tofile);
            System.out.println("htmlCharset: " + htmlCharset);
        }
        catch (Throwable t)
        {
            if (t instanceof OutOfMemoryError)
                Logger.getInstance().fatal("OutOfMemoryError occurred in convertHTML for file: " + fromfile, t);
            System.err.println("Exception occurred in convertHTML, t: " + t);
            t.printStackTrace();
            return false;   //something wrong during the conversion
        }

  return true;
 }

 public static boolean convertPPT(String fromfile, String tofile)
    {
        System.err.println("convertPPT not supported yet!");
        Thread.dumpStack();
        return false;
 // return false;
 }

 public static boolean convertXLS(String fromfile, String tofile)
    {
        StringBuffer sb = new StringBuffer();
        FileInputStream fis = null;
        FileOutputStream fos = null;
        DataOutputStream dos = null;
        HSSFWorkbook wb = null;

        try
        {
            fis = new FileInputStream(new File(fromfile));
            wb = new HSSFWorkbook(fis);

            int numSheets = wb.getNumberOfSheets();
            for (int i=0;i<numSheets;++i)
            {
                HSSFSheet sheet = wb.getSheetAt(i);
                int numRows = sheet.getLastRowNum();
                for (int j=0;j<numRows;++j)
                {
                    HSSFRow row = sheet.getRow(j);
                    if (row == null)
                        continue;
                   
                    int numCells = row.getLastCellNum();
                    for (int k=0;k<numCells;++k)
                    {
                        HSSFCell cell = row.getCell((short)k);
                        if(cell!=null)
                        {
                            int type = cell.getCellType();
                            if(type==HSSFCell.CELL_TYPE_STRING)
                            {
                                String str = cell.getStringCellValue();
                                str=str.trim();
                                str=replace(str,"\n","");
                                sb.append(str).append(" ");
                            }
                        }
                        // We will ignore all other types - numbers, forumlas, etc.
                        // as these don't hold alot of meaning outside of their tabular context.
                        // else if(type==, CELL_TYPE_NUMERIC, CELL_TYPE_FORMULA, CELL_TYPE_BOOLEAN, CELL_TYPE_ERROR
                    } // cells
                    //sb.append("\n"); // break on each row
                } // rows
                sb.append("\n"); // break on each sheet
            } // sheets

            String s = sb.toString();
            //now write this string to a file
            fos = new FileOutputStream(new File(tofile));
            //bos = new BufferedOutputStream(fos);
            //bos.write(s.getBytes());  //what about other language?
            dos = new DataOutputStream(fos);
            dos.writeBytes(s);
        }
        catch (Throwable t)
        {
            if (t instanceof OutOfMemoryError)
                Logger.getInstance().fatal("OutOfMemoryError occurred in convertXSL for file: " + fromfile, t);
            System.err.println("Exception occurred in convertXSL, t: " + t);
            t.printStackTrace();
            return false;   //something wrong during the conversion
        }
        finally
        {
            try
            {
                if (fis != null)
                    fis.close();
                if (fos != null)
                    fos.close();
                if (dos != null)
                    dos.close();
            } catch (Exception e) {}
        }

  return true;
 }


    // This should really be made 'static' and moved into a utility class,
 // included here to simplify things
    private final static String replace(String line, String oldString, String newString)
    {
        if (line == null) {
            return null;
        }
        int i = 0;
        if ((i = line.indexOf(oldString, i)) >= 0) {
            char[] line2 = line.toCharArray(); char[] newString2 = newString.toCharArray(); int oLength = oldString.length();
            StringBuffer buf = new StringBuffer(line2.length); buf.append(line2, 0, i).append(newString2); i += oLength;
            int j = i;
            while ((i = line.indexOf(oldString, i)) > 0) {
                buf.append(line2, j, i - j).append(newString2); i += oLength; j = i;
            }
            buf.append(line2, j, line2.length - j); return buf.toString();
        }
        return line;
    }

    public static void main(String[] args)
    {
        int index = 0;
        String action = args[index++];
        String f1 = args[index++];
        String f2 = args[index++];

        long start = System.currentTimeMillis();
        long end = 0;
        if (action.equals("pdf"))
            convertPDF(f1, f2);
        else if (action.equals("doc"))
            convertDOC(f1, f2);
        else if (action.equals("xls"))
            convertXLS(f1, f2);
        else if (action.equals("ppt"))
            convertPPT(f1, f2);
        else if (action.equals("ppt"))
            convertHTML(f1, f2);

        end = System.currentTimeMillis();
        System.out.println(action + " convert " + f1 + " took " + ((end-start)/1000) + " seconds.");
    }

}

main方法主要是输入三个参数 第一个是转换文档的格式,第二个是文档存放的路径,第三个是要输出的文档存放的位置,

然后对输出的文档进行索引, 平均每个文档在1M-5M之间,

问题: 在进行文档转换的时候pdf,word,xls 都非常慢,本来想写一个threadpool来进行文档的转换,可是测试数据表明多线程转换还不如单线程的快,而且也容易出现outofmemory, 后来我又想了一个办法,把大的pdf ,word xls 进行切分,可是写了一个java的切分成小文档的方法,只能对txt文档进行转换,word 和pdf 因为里面有很多格式和样式的东西都是二进制的,在合成一个大的文档就合并不回去了(c++ 或.net 到时有办法切分),所以希望有过索引大量pdf ,word,xls 文档的朋友给写帮助,能快速处理, 目前的数据量大概是1T(大概是100G),服务器配置大概是4个cpu ,4G内存,虚拟机开到了1.2个G用的是jdk1.4在大也开不了了,谢谢帮助

分享到:
评论
6 楼 nolan022 2008-07-13  
<div class='quote_title'>roger51 写道</div>
<div class='quote_div'>
<p>各位好:在javaeye好长时间了,一直在各大网站学习各位的经验很感谢各位,目前我遇到一个关于lucene索引的问题,在国内和国外的网站上找了很久也没找到一个比较满意的解决办法,所以在这里想问问大家,希望有过这方面的经验的朋友给些帮助,最好能有些比较好的代码或可行性建议,我的代码大概如下</p>
<p><span style='font-family: Arial;'>import com.messagesolution.message.viewer.util.HtmlDocument;<br/>import com.messagesolution.util.logger.Logger;<br/>import org.pdfbox.pdfparser.PDFParser;<br/>import org.pdfbox.util.PDFTextStripper;<br/>import org.apache.poi.hssf.usermodel.HSSFCell;<br/>import org.apache.poi.hssf.usermodel.HSSFRow;<br/>import org.apache.poi.hssf.usermodel.HSSFSheet;<br/>import org.apache.poi.hssf.usermodel.HSSFWorkbook;<br/>import org.textmining.text.extraction.WordExtractor;</span></p>
<p><span style='font-family: Arial;'>import java.io.*;</span></p>
<span style='font-family: Arial;'>
<p><br/>public class DocumentConverter<br/>{</p>
<p> public static boolean convertPDF(String fromfile, String tofile)<br/>    {<br/>        PDFParser parser = null;<br/>        String s = null;<br/>        FileInputStream in = null;<br/>        FileOutputStream fos = null;<br/>        //BufferedOutputStream bos = null;<br/>        DataOutputStream dos = null;<br/>        try<br/>        {<br/>            try {<br/>    PDFTextStripper _stripper = new PDFTextStripper();<br/>    in = new FileInputStream(new File(fromfile));<br/>    parser = new PDFParser(in);<br/>    parser.parse();<br/>    s = _stripper.getText(parser.getDocument());<br/>    if (StringToolKit.isEmpty(s)){<br/>     Logger.getInstance().error("read string of pdf is empty");<br/>     <br/>     return false;       //nothing to write<br/>    }<br/>        <br/>   } catch (Exception e) {<br/>    Logger.getInstance().error("read pdf or convert it error");<br/>    e.printStackTrace();<br/>    return false;<br/>   }</p>
<p>            try {<br/>    //now write this string to a file<br/>    fos = new FileOutputStream(new File(tofile));<br/>    //bos = new BufferedOutputStream(fos);<br/>    //bos.write(s.getBytes());  //what about other language?<br/>    dos = new DataOutputStream(fos);<br/>    dos.writeBytes(s);<br/>   } catch (Exception e) {<br/>    Logger.getInstance().error("write converted txt error");<br/>    e.printStackTrace();<br/>    return false;<br/>   }<br/>        }<br/>        catch (Throwable t)<br/>        {<br/>            if (t instanceof OutOfMemoryError)<br/>                Logger.getInstance().fatal("OutOfMemoryError occurred in convertPDF for file: " + fromfile, t);<br/>            System.err.println("Exception occurred in convertPDF, t: " + t);<br/>            t.printStackTrace();<br/>            return false;   //something wrong during the conversion<br/>        }<br/>        finally {<br/>            try<br/>            {<br/>                if (parser != null)<br/>                    parser.getDocument().close();<br/>                if (in != null)<br/>                    in.close();<br/>                if (fos != null)<br/>                    fos.close();<br/>                //if (bos != null)<br/>                //    bos.close();<br/>                if (dos != null)<br/>                    dos.close();<br/>            } catch (Exception ex) {<br/>              Logger.getInstance().error(ex.toString());<br/>            }<br/>        }</p>
<p>  return true;<br/> }</p>
<p> public static boolean convertDOC(String fromfile, String tofile)<br/>    {<br/>        FileInputStream fis = null;<br/>        FileOutputStream fos = null;<br/>        DataOutputStream dos = null;</p>
<p>        try<br/>        {<br/>            fis = new FileInputStream (new File(fromfile));<br/>            WordExtractor extractor = new WordExtractor();<br/>            String s = extractor.extractText(fis);</p>
<p>            //now write this string to a file<br/>            fos = new FileOutputStream(new File(tofile));<br/>            //bos = new BufferedOutputStream(fos);<br/>            //bos.write(s.getBytes());  //what about other language?<br/>            dos = new DataOutputStream(fos);<br/>            dos.writeBytes(s);<br/>        }<br/>        catch (Throwable t)<br/>        {<br/>            if (t instanceof OutOfMemoryError)<br/>                Logger.getInstance().fatal("OutOfMemoryError occurred in convertDOC for file: " + fromfile, t);<br/>            System.err.println("Exception occurred in convertDOC, t: " + t);<br/>            t.printStackTrace();<br/>            return false;   //something wrong during the conversion<br/>        }<br/>        finally<br/>        {<br/>            try<br/>            {<br/>                if (fis != null)<br/>                    fis.close();<br/>                if (fos != null)<br/>                    fos.close();<br/>                if (dos != null)<br/>                    dos.close();<br/>            } catch (Exception e) {}<br/>        }</p>
<p>  return true;<br/> }</p>
<p> public static boolean convertHTML(String fromfile, String tofile)<br/>    {<br/>        try<br/>        {<br/>            String htmlCharset = HtmlDocument.convertHtml(fromfile, tofile);<br/>            System.out.println("htmlCharset: " + htmlCharset);<br/>        }<br/>        catch (Throwable t)<br/>        {<br/>            if (t instanceof OutOfMemoryError)<br/>                Logger.getInstance().fatal("OutOfMemoryError occurred in convertHTML for file: " + fromfile, t);<br/>            System.err.println("Exception occurred in convertHTML, t: " + t);<br/>            t.printStackTrace();<br/>            return false;   //something wrong during the conversion<br/>        }</p>
<p>  return true;<br/> }</p>
<p> public static boolean convertPPT(String fromfile, String tofile)<br/>    {<br/>        System.err.println("convertPPT not supported yet!");<br/>        Thread.dumpStack();<br/>        return false;<br/> // return false;<br/> }</p>
<p> public static boolean convertXLS(String fromfile, String tofile)<br/>    {<br/>        StringBuffer sb = new StringBuffer();<br/>        FileInputStream fis = null;<br/>        FileOutputStream fos = null;<br/>        DataOutputStream dos = null;<br/>        HSSFWorkbook wb = null;</p>
<p>        try<br/>        {<br/>            fis = new FileInputStream(new File(fromfile));<br/>            wb = new HSSFWorkbook(fis);</p>
<p>            int numSheets = wb.getNumberOfSheets();<br/>            for (int i=0;i&lt;numSheets;++i)<br/>            {<br/>                HSSFSheet sheet = wb.getSheetAt(i);<br/>                int numRows = sheet.getLastRowNum();<br/>                for (int j=0;j&lt;numRows;++j)<br/>                {<br/>                    HSSFRow row = sheet.getRow(j);<br/>                    if (row == null)<br/>                        continue;<br/>                    <br/>                    int numCells = row.getLastCellNum();<br/>                    for (int k=0;k&lt;numCells;++k)<br/>                    {<br/>                        HSSFCell cell = row.getCell((short)k);<br/>                        if(cell!=null)<br/>                        {<br/>                            int type = cell.getCellType();<br/>                            if(type==HSSFCell.CELL_TYPE_STRING)<br/>                            {<br/>                                String str = cell.getStringCellValue();<br/>                                str=str.trim();<br/>                                str=replace(str,"\n","");<br/>                                sb.append(str).append(" ");<br/>                            }<br/>                        }<br/>                        // We will ignore all other types - numbers, forumlas, etc.<br/>                        // as these don't hold alot of meaning outside of their tabular context.<br/>                        // else if(type==, CELL_TYPE_NUMERIC, CELL_TYPE_FORMULA, CELL_TYPE_BOOLEAN, CELL_TYPE_ERROR<br/>                    } // cells<br/>                    //sb.append("\n"); // break on each row<br/>                } // rows<br/>                sb.append("\n"); // break on each sheet<br/>            } // sheets</p>
<p>            String s = sb.toString();<br/>            //now write this string to a file<br/>            fos = new FileOutputStream(new File(tofile));<br/>            //bos = new BufferedOutputStream(fos);<br/>            //bos.write(s.getBytes());  //what about other language?<br/>            dos = new DataOutputStream(fos);<br/>            dos.writeBytes(s);<br/>        }<br/>        catch (Throwable t)<br/>        {<br/>            if (t instanceof OutOfMemoryError)<br/>                Logger.getInstance().fatal("OutOfMemoryError occurred in convertXSL for file: " + fromfile, t);<br/>            System.err.println("Exception occurred in convertXSL, t: " + t);<br/>            t.printStackTrace();<br/>            return false;   //something wrong during the conversion<br/>        }<br/>        finally<br/>        {<br/>            try<br/>            {<br/>                if (fis != null)<br/>                    fis.close();<br/>                if (fos != null)<br/>                    fos.close();<br/>                if (dos != null)<br/>                    dos.close();<br/>            } catch (Exception e) {}<br/>        }</p>
<p>  return true;<br/> }</p>
<p><br/>    // This should really be made 'static' and moved into a utility class,<br/> // included here to simplify things<br/>    private final static String replace(String line, String oldString, String newString)<br/>    {<br/>        if (line == null) {<br/>            return null;<br/>        }<br/>        int i = 0;<br/>        if ((i = line.indexOf(oldString, i)) &gt;= 0) {<br/>            char[] line2 = line.toCharArray(); char[] newString2 = newString.toCharArray(); int oLength = oldString.length();<br/>            StringBuffer buf = new StringBuffer(line2.length); buf.append(line2, 0, i).append(newString2); i += oLength;<br/>            int j = i;<br/>            while ((i = line.indexOf(oldString, i)) &gt; 0) {<br/>                buf.append(line2, j, i - j).append(newString2); i += oLength; j = i;<br/>            }<br/>            buf.append(line2, j, line2.length - j); return buf.toString();<br/>        }<br/>        return line;<br/>    }</p>
<p>    public static void main(String[] args)<br/>    {<br/>        int index = 0;<br/>        String action = args[index++];<br/>        String f1 = args[index++];<br/>        String f2 = args[index++];</p>
<p>        long start = System.currentTimeMillis();<br/>        long end = 0;<br/>        if (action.equals("pdf"))<br/>            convertPDF(f1, f2);<br/>        else if (action.equals("doc"))<br/>            convertDOC(f1, f2);<br/>        else if (action.equals("xls"))<br/>            convertXLS(f1, f2);<br/>        else if (action.equals("ppt"))<br/>            convertPPT(f1, f2);<br/>        else if (action.equals("ppt"))<br/>            convertHTML(f1, f2);</p>
<p>        end = System.currentTimeMillis();<br/>        System.out.println(action + " convert " + f1 + " took " + ((end-start)/1000) + " seconds.");<br/>    }</p>
<p>}</p>
<p>main方法主要是输入三个参数 第一个是转换文档的格式,第二个是文档存放的路径,第三个是要输出的文档存放的位置,</p>
<p>然后对输出的文档进行索引, 平均每个文档在1M-5M之间,</p>
<p>问题: 在进行文档转换的时候pdf,word,xls 都非常慢,本来想写一个threadpool来进行文档的转换,可是测试数据表明多线程转换还不如单线程的快,而且也容易出现outofmemory, 后来我又想了一个办法,把大的pdf ,word xls 进行切分,可是写了一个java的切分成小文档的方法,只能对txt文档进行转换,word 和pdf 因为里面有很多格式和样式的东西都是二进制的,在合成一个大的文档就合并不回去了(c++ 或.net 到时有办法切分),所以希望有过索引大量pdf ,word,xls 文档的朋友给写帮助,能快速处理, 目前的数据量大概是1T(大概是100G),服务器配置大概是4个cpu ,4G内存,虚拟机开到了1.2个G用的是jdk1.4在大也开不了了,谢谢帮助</p>
</span></div>
<p> </p>
5 楼 roger51 2007-07-31  
我的解决了,就是用我io 效率那个blog的办法,原来是一个10word 文档要16秒左右,现在只需要二秒就够了
4 楼 jenkinv 2007-07-30  
我遇到和你一样的问题,,期待好的解决方案。
3 楼 wl95421 2007-07-24  
POI处理Excel的速度不快
你最好用其它的方法
如果你是在Windows下
而且速度要求比较高
可以考虑用Jacob将Excel先转成Html
然后做索引

PDF和Word也是这样处理
特别是PDF,如果用iText处理,非常的慢
我测试过iText输出PDF,400页的文件,约400M内存,而且和用WPS输出PDF的性能有2-3个数量级的差距
2 楼 roger51 2007-07-24  

感谢你的帮助,
使用RamDirectory来进行索引,再将结果写入FSDirectory,性能肯定会高很多,我目前已经是在lucene indexwriter 的时候用了,应该不是这的问题,merge的基数大概是2000左右,所以问题也不是这里,感觉是开源的jar在处理pdf word ,xls等文件的时候转换的太慢,profier我会试试看,不过其他服务程序还是可以运行的不错的,所以内存上感觉问题也不大,
1 楼 wl95421 2007-07-24  
你先用Profier这种东东看一下大致的资源开销分布
是内存不足,频繁回收,或者是解析文件太慢
又或者是因为merge的次数太频繁

你还可以尝试一下将文件读入内存,使用RamDirectory来进行索引,再将结果写入FSDirectory,性能肯定会高很多

相关推荐

    读取各类文件内容(doc,docx,ppt,pptx,xls,xlsx,pdf,txt等)

    本篇文章将详细讲解如何使用Apache POI和PDFBox库来读取doc, docx, ppt, pptx, xls, xlsx, pdf以及txt等各类文件的内容。 首先,Apache POI是一个流行的Java API,专门用于处理Microsoft Office格式的文件,如Word...

    Java抽取Word和PDF格式文件

    Apache POI是Apache软件基金会的一个项目,专门用于处理Microsoft Office格式的文件,如Word(.doc/.docx)、Excel(.xls/.xlsx)。对于Word文档,POI提供了低级别API(HPSF)和高级API(HWPF)。以下是一个使用POI...

    java解析MSFormat和PDF等多种文件格式的方法总结

    它支持读取和写入Microsoft Word(`.doc`/`.docx`),Excel(`.xls`/`.xlsx`),PowerPoint(`.ppt`/`.pptx`)等多种格式。 ##### 1.1 Apache POI Apache POI提供了一系列API来处理这些文件。例如,为了读取一个`....

    tika读取文件专用包

    这个"tika读取文件专用包"显然包含了Tika项目所需要的各种jar包,这些jar包支持处理多种文件类型,如PDF、DOC、XLS、PPT、HTML、图像等。下面将详细介绍Tika及其在处理不同文件类型时的关键知识点。 1. **Apache ...

    常用Python爬虫库汇总.pdf

    - **textract**: 从不同文档格式中提取文本,如Word、PowerPoint、PDF等。 - **messytables**: 解析混乱的表格数据。 - **rows**: 提供统一的接口来处理多种数据格式。 - **python-docx**: 操作Microsoft Word ....

    tika提取文本内容

    Tika利用Apache的MIME类型识别系统来识别文件类型,并且能够处理大量的文档格式,如PDF、Word、Excel、HTML、XML、图片等。 Tika的核心功能是内容提取,这意味着它可以从不同类型的文件中抽取纯文本,这对于搜索...

    .NET平台上的文件抽取框架toxy.zip

    toxy是.NET平台上的文件抽取框架,主要解决各种格式的内容抽取问题,比如pdf, doc, docx, xls, xlsx等,尽管听上去支持了很多格式,但它的使用却是极其方便的,因为Toxy把复杂的抽取流程透明化,Toxy的用户根本不用...

    网络爬虫以及pdfwordexcel等数据处理分析

    本文将详细探讨网络爬虫的原理与实现,以及如何利用Python进行PDF、Word、Excel等文件的数据处理与分析。 首先,网络爬虫是一种自动抓取互联网信息的程序,它按照一定的规则遍历网页,收集所需数据。在"zhuceall.py...

    Python资源之特殊格式处理.docx

    textract 是一个从任何文档中提取文本的库,支持 Word, PowerPoint, PDF 等多种文档格式。PDFMiner 是一个从 PDF 文档中提取信息的工具。PyPDF2 是一个分割、合并、转换 PDF 文件的库。ReportLab 是一个可以快速创建...

    常用Python爬虫库汇总.docx

    textract可以从不同类型的文档中提取文本,如Word、PDF等。PDFMiner和PyPDF2则专门用于处理PDF文档,提取其中的信息。 此外,还有一些辅助库如HTTP Agent Parser解析HTTP User-Agent,phonenumbers处理电话号码,...

    js各种格式文件导出总汇

    本文将详细讲解如何使用JavaScript实现CSV、DOC、JSON、PDF、PNG、SQL、TSV、TXT、XLS、XLSX以及XML等不同格式文件的导出。 1. CSV(逗号分隔值)导出: CSV是常见的数据交换格式,适用于简单的表格数据。可以使用...

    Python所有的库都在这里了!!强烈建议收藏.docx

    1. dateutil:标准的 Python 官方 datetime 模块的扩展包,字符串日期工具,其中 parser 是根据字符串解析成 datetime,而 rrule 是则是根据定义的规则来生成 datetime。 2. arrow:更好的日期和时间处理 Python 库...

    python 开发库介绍

    26. **python-docx**: 用于读取、查询和修改Word文档(docx)。 27. **xlwt/xlrd**: 读写Excel文件的库,支持旧的.xls格式。 28. **xlsxwriter**: 创建新的Excel.xlsx文件的库,支持更多格式特性。 29. **xlwings**...

Global site tag (gtag.js) - Google Analytics