`

ruby textfile vs binaryfile

阅读更多

The Difference Between Binary and ASCII Files; Converting them

At heart all files are binary files -- that is, a collection of 1s and 0s. But there's a subset of binary files we call ASCII, or plain text files. ASCII is short for American Standard Code for Information Interchange, which allocates a number to each letter, digit and symbol. A plain text file contains no formatting codes whatsoever, no fonts, bold, italics or underlines, headers, footers or graphics. The only 'formatting' that can be applied is to use spaces to pad lines out so that they are centered or right justified, or to add extra blank lines.

Let's look at some example - different file types containing the word 'hello' followed by a new line. To see the differences, we will use a hex display, to show us exactly what is in each file, and highlight the word hello in the output so we can see where it appears. A hex display shows us the offset, ASCII code and the actual characters contained in a file (where those characters are printable - it shows a period where they are not). These hex displays were generated by TextPipe Pro (Filters Menu\Convert\Hex dump).

Plain Text File - hello.txt (7 bytes long)

This is the simplest file - the ASCII codes for the letters 'hello' followed by the ASCII codes for a carriage return and line feed.

00000000 68 65 6C 6C 6F 0D 0A                            hello..

Rich Text Format (RTF) File - hello.rtf (168 bytes long)

You can see that an RTF file includes lots of extra guff. Generally, all the letters of the word will be together. However, if you have two or more words, other codes can appear between the words, making them difficult to locate.

00000000 7B 5C 72 74 66 31 5C 61 6E 73 69 5C 61 6E 73 69 {\rtf1\ansi\ansi 
00000010 63 70 67 31 32 35 32 5C 64 65 66 66 30 5C 64 65 cpg1252\deff0\de 
00000020 66 6C 61 6E 67 33 30 38 31 7B 5C 66 6F 6E 74 74 flang3081{\fontt 
00000030 62 6C 7B 5C 66 30 5C 66 73 77 69 73 73 5C 66 63 bl{\f0\fswiss\fc 
00000040 68 61 72 73 65 74 30 20 41 72 69 61 6C 3B 7D 7D harset0 Arial;}} 
00000050 0D 0A 7B 5C 2A 5C 67 65 6E 65 72 61 74 6F 72 20 ..{\*\generator 
00000060 4D 73 66 74 65 64 69 74 20 35 2E 34 31 2E 31 35 Msftedit 5.41.15 
00000070 2E 31 35 30 33 3B 7D 5C 76 69 65 77 6B 69 6E 64 .1503;}\viewkind 
00000080 34 5C 75 63 31 5C 70 61 72 64 5C 66 30 5C 66 73 4\uc1\pard\f0\fs 
00000090 32 30 20 68 65 6C 6C 6F 5C 70 61 72 0D 0A 5C 70 20 hello\par..\p 
000000A0 61 72 0D 0A 7D 0D 0A 00                         ar..}... 

Microsoft Word Document - hello.doc (19,968 bytes long)

The file below, even without any formatting, is huge, so we've removed large sections of it for clarity. A major point we have to make here is that Word relies on the exact position of various aspects of the file being fixed, such as font tables, symbol tables and other internal references. If these positions are changed (e.g. by searching for 'hello' and replacing it with a shorter string such as 'bye' or a longer string such as 'hello there') then the document will be corrupted and MS Word will not be able to load the document again. Recovery may not be possible. This is why you CANNOT use a text editor or text tool on Word documents. You must use a specific tool that knows how to maintain the correct offsets, such as WordPipe for MS Word, ExcelPipe for MS Excel or PowerPointPipe for MS PowerPoint.

An additional point to note is that the word 'Symbol' is stored in the Word document in Unicode format (see below), so a text editor or text tool will not find it. Since this file contains mixed sections of ASCII and Unicode, it is crucial that the file positions are left unchanged.

00000000 D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00 ÐÏ.ࡱ.á........ 
00000010 00 00 00 00 00 00 00 00 3E 00 03 00 FE FF 09 00 ........>...þÿ.. 
00000020 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ 
...
000009F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 
00000A00 68 65 6C 6C 6F 0D 0D 00 00 00 00 00 00 00 00 00 hello........... 
00000A10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 
...
00001A40 00 53 00 79 00 6D 00 62 00 6F 00 6C 00 00 00 33 .S.y.m.b.o.l...3 
...
00004DF0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 

Unicode Plain Text File - hello.txt (16 bytes long)

ASCII is being replaced in many applications by Unicode, which uses 16 bits (2 bytes) per character to represent non-Roman alphabets like Japanese, Chinese, and Cyrillic. A text editor or text tool won't find 'hello' in this file. TextPipe Pro provides Unicode search and replace facilities, in addition to ASCII search and replace, so it can find both forms of 'hello'.

00000000 FF FE 68 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 ÿþh.e.l.l.o..... 

Convert binary files to text files

Now, to convert a binary file to a useful text form, you need to strip out all the binary characters - the formatting, control and other gobbledygook stuff.  TextPipe Pro provides a simple filter for this under Filters\Remove\Binary characters.

Converting a binary file to a text file

You can also generate your own custom filter that only removes the binary characters you specify by using Filters\Maps\New map.

 

You may freely link to this page, but you may not copy its content.

 

 

-------------------------------------------------------------------------------------------------------------------------------

 

from http://book.77169.org/ask2/ask112678.htm

从文件编码的方式来看,文件可分为ASCII码文件和二进制码文件两种。

  ASCII文件也称为文本文件,这种文件在磁盘中存放时每个字符对应一个字节,用于存放对应的ASCII码。例如,数5678的存储形式为:
ASC码:  00110101 00110110 00110111 00111000
     ↓     ↓    ↓    ↓
十进制码: 5     6    7    8 共占用4个字节。ASCII码文件可在屏幕上按字符显示, 例如源程序文件就是ASCII文件,用DOS命令TYPE可显示文件的内容。 由于是按字符显示,因此能读懂文件内容。

  二进制文件是按二进制的编码方式来存放文件的。 例如, 数5678的存储形式为: 00010110 00101110只占二个字节。二进制文件虽然也可在屏幕上显示, 但其内容无法读懂。

1949存储为079D(对应二进制为0000 0111 1001 1101,即十进制1949的等值数)

 

分享到:
评论

相关推荐

    ruby setup file

    ruby a script code setup file

    ruby_file_jekyll-paginate-plugin.zip

    ruby_file_jekyll-paginate-plugin

    Ruby Ruby Ruby Ruby Ruby Ruby

    Ruby Ruby Ruby Ruby Ruby Ruby

    ruby_test_file

    ruby test_file class des

    Ruby读写txt文件

    file.write('\n appending text...') end ``` 二、读取文件内容 1. 逐行读取:Ruby提供了`each_line`方法,可以按行读取文件内容。 ```ruby File.open('example.txt', 'r') do |file| file.each_line do |line|...

    ruby的二进制字符串与hex互转,二进制字符串与整数互转的工具函数

    本资源是ruby代码,提供了一系列封装好的函数,用于快速进行转换,一个函数搞定,包括如下转换,二进制字符串与hex字符串的互转。二进制字符串与整数互转,包括uint8,uin16,uint32, 以及本地字节序和网络字节序两种...

    Ruby-Refile一个现代的文件上传Ruby应用程序库它是简单的但功能强大

    Ruby的Refile库是用于构建Web应用程序中的文件上传功能的一个高效且灵活的解决方案。它旨在提供简洁的API,同时保持高度可定制性,使得开发者在处理用户上传的文件时能够轻松应对各种需求。Refile的核心理念是将文件...

    Text Processing with Ruby ruby文档解析

    综上所述,通过分析文档内容,我们可以得出书籍《Text Processing with Ruby》不仅是一本关于Ruby文本处理的实用指南,而且它还获得了业界专家的高度评价。书中不仅仅涵盖了Ruby文本处理的基础知识,还有助于初学者...

    ruby DBI ruby DBI ruby DBI

    ruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ruby DBIruby DBI ruby DBI ...

    Ruby-rubybuild编译和安装Ruby

    Ruby是一种动态、开源的编程语言,以其简洁、优雅的语法和强大的元编程能力著称。在Ruby开发中,为了管理不同版本的Ruby环境,我们常常会使用到`rbenv`和`ruby-build`这两个工具。本文将详细介绍如何使用`ruby-build...

    Ruby in Steel_vs2008

    **Ruby in Steel vs2008** 是一个集成开发环境(IDE),专为在Microsoft Visual Studio 2008平台上进行Ruby编程而设计。这个工具允许开发者利用Visual Studio的强大功能来编写、调试和管理Ruby on Rails项目。Ruby ...

    Ruby-textacular利用PostgreSQL让ActiveRecord支持全文搜索

    这个库充分利用了PostgreSQL的全文搜索(Full-Text Search)特性,使得在Rails应用中实现复杂、高效的文本搜索变得简单易行。在安装`textactal`时,确保你的数据库是PostgreSQL,因为该gem是专门为这个数据库系统...

    ruby 目录操作详细介绍

    在Ruby编程语言中,对目录的操作是至关重要的,特别是在处理文件系统时。下面将详细介绍如何在Ruby中进行目录的创建、删除、查询、修改以及读取文件等操作。 1. 创建文件夹 Ruby提供了多种创建文件夹的方法。例如,...

    Ruby完全自学手册 下

    《Ruby完全自学手册》是一本完全覆盖Ruby和Ruby on Rails的完全自学手册。《Ruby完全自学手册》的特色是由浅入深、循序渐进,注重理论和实践的结合。虽然定位为入门手册,但是依然涉及许多高级技术和应用,覆盖到的...

    ruby2ruby.zip

    ruby2ruby 提供一些用来根据 RubyParser 兼容的 Sexps 轻松生成纯 Ruby 代码的方法。可在 Ruby 中轻松实现动态语言处理。 标签:ruby2ruby

    ruby文件操作,简单ppt

    file = File.new(File.join("F:/ruby", "aaa.txt"), "w+") ``` 这里有两个重要的参数需要关注: - **第一个参数**:表示文件的路径(可以是绝对路径也可以是相对路径)。 - **第二个参数**:文件的操作模式,不同的...

    Ruby-rubyinstall安装RubyJRubyRubiniusMagLevorMRuby

    Ruby是一种强大的、面向对象的脚本语言,广泛用于Web开发、服务器端编程和各种应用程序。在Ruby的世界里,管理不同的Ruby实现(如MRI、JRuby、Rubinius、MagLev和MRuby)是非常重要的,这有助于开发者根据项目需求...

    src-oepkgs/ruby-ruby2ruby

    src-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2rubysrc-oepkgs/ruby-ruby2...

    Ruby完全自学手册

    Ruby是一种简洁而功能强大的编程语言,由日本的松本行弘(Yukihiro "Matz" Matsumoto)在1993年开发,并于1995年公开发布。Ruby语言设计之初就非常注重开发人员的编程体验,它拥有自然、表达性强的语法,易于阅读和...

    ruby源代码 ruby源代码 ruby源代码 ruby源代码2

    ruby源代码 ruby源代码 ruby源代码 ruby源代码2

Global site tag (gtag.js) - Google Analytics