- 浏览: 13014 次
- 性别:
- 来自: 北京
最新评论
最近在做一个FileUtil,技术采用New IO, 在做的时候出现了编码问题!
例如:
我采用writeFile("D:\test.txt","中国",null)
然后我用readFile("D:\test.txt")读结果就会返回乱码!
后来我用Charset解码 , 获取目标文件编码(System.getProperty("file.encoding")),但是还是不行.
估计我应该要获得目标文件字节流的编码,这样才能根据相应的编码去读文件.
我怎么样才能判断目标文件的字节流编码呢?
或许我们会有更好的办法.请各位指教?谢谢!
严格来说,没有绝对有效的办法判断(如果没有任何其他标识的话),当然,有的软件比较智能,能通过分析文件部分内容“猜出”编码,比如 UltraEdit等。
你可以看看 xml, html, http header... 中都需要有专门的 encoding, charset 之类的东西.
例如:
我采用writeFile("D:\test.txt","中国",null)
然后我用readFile("D:\test.txt")读结果就会返回乱码!
后来我用Charset解码 , 获取目标文件编码(System.getProperty("file.encoding")),但是还是不行.
估计我应该要获得目标文件字节流的编码,这样才能根据相应的编码去读文件.
我怎么样才能判断目标文件的字节流编码呢?
或许我们会有更好的办法.请各位指教?谢谢!
// 读文件采用的字符编码. private static Charset charset = Charset.forName(System.getProperty("file.encoding")); /** * 读文件 * @param fileName * @return 读入的字符串 */ public String readFile(String fileName){ CharBuffer cb = null; try { FileChannel in = new FileInputStream(fileName).getChannel(); int size = (int)in.size(); MappedByteBuffer mppedByteBuffer = in.map(FileChannel.MapMode.READ_ONLY, 0, size); cb = charset.newDecoder().decode(mppedByteBuffer); in.close(); } catch(FileNotFoundException e) { e.printStackTrace(); } catch(IOException e){ e.printStackTrace(); } return cb.toString(); } /** * 创建文件(写文件) * @param fileName 文件名 * @param content 内容 * @param encoding 编码 (就是你需要以哪一种编码格式进行写入) . 默认采用(utf-8). * @return true 创建成功 ,false 创建失败 */ public boolean writeFile(String fileName,String content,String encoding){ try{ FileChannel out = new FileOutputStream(fileName).getChannel(); encoding = encoding == null ? "" : encoding; if(encoding.length() <= 0) encoding = "utf-8"; out.write(ByteBuffer.wrap(content.getBytes(encoding))); out.close(); return true; }catch(FileNotFoundException e){ e.printStackTrace(); }catch(IOException e){ e.printStackTrace(); } return false; }
评论
7 楼
zjit
2008-06-09
多谢tianzhihua.
6 楼
tianzhihua
2008-06-09
声明一下,上面的代码来源是从网上看到的,自己修改了一下,出自那里忘记了
5 楼
tianzhihua
2008-06-09
<pre name='code' class='java'>package org.simpleframework.util;
import junit.framework.TestCase;
public class StringEncodingTest extends TestCase{
public void testGetEncoding() throws Exception{
System.out.println(new StringEncoding().getEncoding("费多少发送到费多少".getBytes()).getEncoding());
System.out.println(new StringEncoding().getEncoding("費多少 費多少".getBytes()).getEncoding());
System.out.println(new StringEncoding().getEncoding("あなたの訳すセンテンスを入力して下さい".getBytes()).getEncoding());
}
}
</pre>
<pre name='code' class='java'/>
<pre name='code' class='java'>package org.simpleframework.util;
public class StringEncoding{
public static final class Encoding{
private String name;
private String encoding;
public Encoding(String name,String encoding){
this.name = name;
this.encoding = encoding;
}
public String getName() {
return name;
}
public String getEncoding() {
return encoding;
}
}
private final static int GB2312 = 0;
private final static int GBK = 1;
private final static int BIG5 = 2;
private final static int UTF8 = 3;
private final static int UNICODE = 4;
private final static int EUC_KR = 5;
private final static int SJIS = 6;
private final static int EUC_JP = 7;
private final static int ASCII = 8;
private final static int UNKNOWN = 9;
private final static int TOTALT = 10;
private static Encoding[] encodings;
private int[][] GB2312format;
private int[][] GBKformat;
private int[][] Big5format;
private int[][] EUC_KRformat;
private int[][] JPformat;
static{
initEncodings();
}
private static void initEncodings() {
encodings = new Encoding[TOTALT];
int i = 0;
encodings[i++] = new Encoding("GB2312","GB2312");
encodings[i++] = new Encoding("GBK","GBK");
encodings[i++] = new Encoding("BIG5","BIG5");
encodings[i++] = new Encoding("UTF8","UTF-8");
encodings[i++] = new Encoding("UNICODE(UTF-16)","UTF-16");
encodings[i++] = new Encoding("EUC-KR","EUC-KR");
encodings[i++] = new Encoding("Shift-JIS","Shift_JIS");
encodings[i++] = new Encoding("EUC-JP","EUC-JP");
encodings[i++] = new Encoding("ASCII","ASCII");
encodings[i++] = new Encoding("ISO8859-1","ISO8859-1");
}
public StringEncoding(){
init();
}
private void init() {
GB2312format = new int[94][94];
GBKformat = new int[126][191];
Big5format = new int[94][158];
EUC_KRformat = new int[94][94];
JPformat = new int[94][94];
}
public Encoding getEncoding(final byte[] data){
return check(getEncodingValue(data));
}
private static Encoding check(final int result){
if (result == -1){
return encodings[UNKNOWN];
}
return encodings[result];
}
private int getEncodingValue(byte[] content){
if (content == null)
return -1;
int[] scores;
int index, maxscore = 0;
int encoding = UNKNOWN;
scores = new int[TOTALT];
// 分配或然率
scores[GB2312] = getProbabilityByGB2312Encoding(content);
scores[GBK] = getProbabilityByGBKEncoding(content);
scores[BIG5] = getProbabilityByBIG5Encoding(content);
scores[UTF8] = getProbabilityByUTF8Encoding(content);
scores[UNICODE] = getProbabilityByUTF16Encoding(content);
scores[EUC_KR] = getProbabilityByEUC_KREncoding(content);
scores[ASCII] = getProbabilityByASCIIEncoding(content);
scores[SJIS] = getProbabilityBySJISEncoding(content);
scores[EUC_JP] = getProbabilityByEUC_JPEncoding(content);
scores[UNKNOWN] = 0;
// 概率比较
for (index = 0; index < TOTALT; index++){
if (scores[index] > maxscore){
// 索引
encoding = index;
// 最大几率
maxscore = scores[index];
}
}
// 返回或然率大于50%的数据
if (maxscore <= 50){
encoding = UNKNOWN;
}
return encoding;
}
/** *//**
* gb2312数据或然率计算
*
* @param content
* @return
*/
private int getProbabilityByGB2312Encoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, gbchars = 1;
long gbformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
// 检查是否在亚洲汉字范围内
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
// 汉字GB码由两个字节组成,每个字节的范围是0xA1 ~ 0xFE
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xF7
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (GB2312format[row][column] != 0){
gbformat += GB2312format[row][column];
} else if (15 <= row && row < 55){
// 在gb编码范围
gbformat += 200;
}
}
i++;
}
}
rangeval = 50 * ((float) gbchars / (float) dbchars);
formatval = 50 * ((float) gbformat / (float) totalformat);
return (int) (rangeval + formatval);
}
/** *//**
* gb2312或然率计算
*
* @param content
* @return
*/
private int getProbabilityByGBKEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, gbchars = 1;
long gbformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xF7
&& // gb范围
(byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (GB2312format[row][column] != 0){
gbformat += GB2312format[row][column];
} else if (15 <= row && row < 55){
gbformat += 200;
}
} else if ((byte) 0x81 <= content[i]
&& content[i] <= (byte) 0xFE && // gb扩展区域
(((byte) 0x80 <= content[i + 1] && content[i + 1] <= (byte) 0xFE) || ((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E))){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0x81;
if (0x40 <= content[i + 1] && content[i + 1] <= 0x7E){
column = content[i + 1] - 0x40;
} else{
column = content[i + 1] + 256 - 0x40;
}
if (GBKformat[row][column] != 0){
gbformat += GBKformat[row][column];
}
}
i++;
}
}
rangeval = 50 * ((float) gbchars / (float) dbchars);
formatval = 50 * ((float) gbformat / (float) totalformat);
return (int) (rangeval + formatval) - 1;
}
/** *//**
* 解析为big5的或然率
*
* @param content
* @return
*/
private int getProbabilityByBIG5Encoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, bfchars = 1;
float rangeval = 0, formatval = 0;
long bfformat = 0, totalformat = 1;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i]
&& content[i] <= (byte) 0xF9
&& (((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E) || ((byte) 0xA1 <= content[i + 1] && content[i + 1] <= (byte) 0xFE))){
bfchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
if (0x40 <= content[i + 1] && content[i + 1] <= 0x7E){
column = content[i + 1] - 0x40;
} else{
column = content[i + 1] + 256 - 0x61;
}
if (Big5format[row][column] != 0){
bfformat += Big5format[row][column];
} else if (3 <= row && row <= 37){
bfformat += 200;
}
}
i++;
}
}
rangeval = 50 * ((float) bfchars / (float) dbchars);
formatval = 50 * ((float) bfformat / (float) totalformat);
return (int) (rangeval + formatval);
}
/** *//**
* 在utf-8中的或然率
*
* @param content
* @return
*/
private int getProbabilityByUTF8Encoding(byte[] content){
int score = 0;
int i, rawtextlen = 0;
int goodbytes = 0, asciibytes = 0;
// 检查是否为汉字可接受范围
rawtextlen = content.length;
for (i = 0; i < rawtextlen; i++){
if ((content[i] & (byte) 0x7F) == content[i]){
asciibytes++;
} else if (-64 <= content[i] && content[i] <= -33
&& i + 1 < rawtextlen && -128 <= content[i + 1]
&& content[i + 1] <= -65){
goodbytes += 2;
i++;
} else if (-32 <= content[i] && content[i] <= -17
&& i + 2 < rawtextlen && -128 <= content[i + 1]
&& content[i + 1] <= -65 && -128 <= content[i + 2]
&& content[i + 2] <= -65){
goodbytes += 3;
i += 2;
}
}
if (asciibytes == rawtextlen){
return 0;
}
score = (int) (100 * ((float) goodbytes / (float) (rawtextlen - asciibytes)));
// 如果不高于98则减少到零
if (score > 98){
return score;
} else if (score > 95 && goodbytes > 30){
return score;
} else{
return 0;
}
}
/** *//**
* 检查为utf-16的或然率
*
* @param content
* @return
*/
private int getProbabilityByUTF16Encoding(byte[] content){
if (content.length > 1
&& ((byte) 0xFE == content[0] && (byte) 0xFF == content[1])
|| ((byte) 0xFF == content[0] && (byte) 0xFE == content[1])){
return 100;
}
return 0;
}
/** *//**
* 检查为ascii的或然率
*
* @param content
* @return
*/
private static int getProbabilityByASCIIEncoding(byte[] content){
int score = 75;
int i, rawtextlen;
rawtextlen = content.length;
for (i = 0; i < rawtextlen; i++){
if (content[i] < 0){
score = score - 5;
} else if (content[i] == (byte) 0x1B){ // ESC (used by ISO 2022)
score = score - 5;
}
if (score <= 0){
return 0;
}
}
return score;
}
/** *//**
* 检查为euc_kr的或然率
*
* @param content
* @return
*/
private int getProbabilityByEUC_KREncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, krchars = 1;
long krformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xFE
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
krchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (EUC_KRformat[row][column] != 0){
krformat += EUC_KRformat[row][column];
} else if (15 <= row && row < 55){
krformat += 0;
}
}
i++;
}
}
rangeval = 50 * ((float) krchars / (float) dbchars);
formatval = 50 * ((float) krformat / (float) totalformat);
return (int) (rangeval + formatval);
}
private int getProbabilityByEUC_JPEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, jpchars = 1;
long jpformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xFE
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
jpchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (JPformat[row][column] != 0){
jpformat += JPformat[row][column];
} else if (15 <= row && row < 55){
jpformat += 0;
}
}
i++;
}
}
rangeval = 50 * ((float) jpchars / (float) dbchars);
formatval = 50 * ((float) jpformat / (float) totalformat);
return (int) (rangeval + formatval);
}
private int getProbabilityBySJISEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, jpchars = 1;
long jpformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column, adjust;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if (i + 1 < content.length
&& (((byte) 0x81 <= content[i] && content[i] <= (byte) 0x9F) || ((byte) 0xE0 <= content[i] && content[i] <= (byte) 0xEF))
&& (((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E) || ((byte) 0x80 <= content[i + 1] && content[i + 1] <= (byte) 0xFC))){
jpchars++;
totalformat += 500;
row = content[i] + 256;
column = content[i + 1] + 256;
if (column < 0x9f){
adjust = 1;
if (column > 0x7f){
column -= 0x20;
} else{
column -= 0x19;
}
} else{
adjust = 0;
column -= 0x7e;
}
if (row < 0xa0){
row = ((row - 0x70) << 1) - adjust;
} else{
row = ((row - 0xb0) << 1) - adjust;
}
row -= 0x20;
column = 0x20;
if (row < JPformat.length && column < JPformat[row].length
&& JPformat[row][column] != 0){
jpformat += JPformat[row][column];
}
i++;
} else if ((byte) 0xA1 <= content[i]
&& content[i] <= (byte) 0xDF){
}
}
}
rangeval = 50 * ((float) jpchars / (float) dbchars);
formatval = 50 * ((float) jpformat / (float) totalformat);
return (int) (rangeval + formatval) - 1;
}
}
</pre>
<p> </p>
import junit.framework.TestCase;
public class StringEncodingTest extends TestCase{
public void testGetEncoding() throws Exception{
System.out.println(new StringEncoding().getEncoding("费多少发送到费多少".getBytes()).getEncoding());
System.out.println(new StringEncoding().getEncoding("費多少 費多少".getBytes()).getEncoding());
System.out.println(new StringEncoding().getEncoding("あなたの訳すセンテンスを入力して下さい".getBytes()).getEncoding());
}
}
</pre>
<pre name='code' class='java'/>
<pre name='code' class='java'>package org.simpleframework.util;
public class StringEncoding{
public static final class Encoding{
private String name;
private String encoding;
public Encoding(String name,String encoding){
this.name = name;
this.encoding = encoding;
}
public String getName() {
return name;
}
public String getEncoding() {
return encoding;
}
}
private final static int GB2312 = 0;
private final static int GBK = 1;
private final static int BIG5 = 2;
private final static int UTF8 = 3;
private final static int UNICODE = 4;
private final static int EUC_KR = 5;
private final static int SJIS = 6;
private final static int EUC_JP = 7;
private final static int ASCII = 8;
private final static int UNKNOWN = 9;
private final static int TOTALT = 10;
private static Encoding[] encodings;
private int[][] GB2312format;
private int[][] GBKformat;
private int[][] Big5format;
private int[][] EUC_KRformat;
private int[][] JPformat;
static{
initEncodings();
}
private static void initEncodings() {
encodings = new Encoding[TOTALT];
int i = 0;
encodings[i++] = new Encoding("GB2312","GB2312");
encodings[i++] = new Encoding("GBK","GBK");
encodings[i++] = new Encoding("BIG5","BIG5");
encodings[i++] = new Encoding("UTF8","UTF-8");
encodings[i++] = new Encoding("UNICODE(UTF-16)","UTF-16");
encodings[i++] = new Encoding("EUC-KR","EUC-KR");
encodings[i++] = new Encoding("Shift-JIS","Shift_JIS");
encodings[i++] = new Encoding("EUC-JP","EUC-JP");
encodings[i++] = new Encoding("ASCII","ASCII");
encodings[i++] = new Encoding("ISO8859-1","ISO8859-1");
}
public StringEncoding(){
init();
}
private void init() {
GB2312format = new int[94][94];
GBKformat = new int[126][191];
Big5format = new int[94][158];
EUC_KRformat = new int[94][94];
JPformat = new int[94][94];
}
public Encoding getEncoding(final byte[] data){
return check(getEncodingValue(data));
}
private static Encoding check(final int result){
if (result == -1){
return encodings[UNKNOWN];
}
return encodings[result];
}
private int getEncodingValue(byte[] content){
if (content == null)
return -1;
int[] scores;
int index, maxscore = 0;
int encoding = UNKNOWN;
scores = new int[TOTALT];
// 分配或然率
scores[GB2312] = getProbabilityByGB2312Encoding(content);
scores[GBK] = getProbabilityByGBKEncoding(content);
scores[BIG5] = getProbabilityByBIG5Encoding(content);
scores[UTF8] = getProbabilityByUTF8Encoding(content);
scores[UNICODE] = getProbabilityByUTF16Encoding(content);
scores[EUC_KR] = getProbabilityByEUC_KREncoding(content);
scores[ASCII] = getProbabilityByASCIIEncoding(content);
scores[SJIS] = getProbabilityBySJISEncoding(content);
scores[EUC_JP] = getProbabilityByEUC_JPEncoding(content);
scores[UNKNOWN] = 0;
// 概率比较
for (index = 0; index < TOTALT; index++){
if (scores[index] > maxscore){
// 索引
encoding = index;
// 最大几率
maxscore = scores[index];
}
}
// 返回或然率大于50%的数据
if (maxscore <= 50){
encoding = UNKNOWN;
}
return encoding;
}
/** *//**
* gb2312数据或然率计算
*
* @param content
* @return
*/
private int getProbabilityByGB2312Encoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, gbchars = 1;
long gbformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
// 检查是否在亚洲汉字范围内
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
// 汉字GB码由两个字节组成,每个字节的范围是0xA1 ~ 0xFE
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xF7
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (GB2312format[row][column] != 0){
gbformat += GB2312format[row][column];
} else if (15 <= row && row < 55){
// 在gb编码范围
gbformat += 200;
}
}
i++;
}
}
rangeval = 50 * ((float) gbchars / (float) dbchars);
formatval = 50 * ((float) gbformat / (float) totalformat);
return (int) (rangeval + formatval);
}
/** *//**
* gb2312或然率计算
*
* @param content
* @return
*/
private int getProbabilityByGBKEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, gbchars = 1;
long gbformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xF7
&& // gb范围
(byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (GB2312format[row][column] != 0){
gbformat += GB2312format[row][column];
} else if (15 <= row && row < 55){
gbformat += 200;
}
} else if ((byte) 0x81 <= content[i]
&& content[i] <= (byte) 0xFE && // gb扩展区域
(((byte) 0x80 <= content[i + 1] && content[i + 1] <= (byte) 0xFE) || ((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E))){
gbchars++;
totalformat += 500;
row = content[i] + 256 - 0x81;
if (0x40 <= content[i + 1] && content[i + 1] <= 0x7E){
column = content[i + 1] - 0x40;
} else{
column = content[i + 1] + 256 - 0x40;
}
if (GBKformat[row][column] != 0){
gbformat += GBKformat[row][column];
}
}
i++;
}
}
rangeval = 50 * ((float) gbchars / (float) dbchars);
formatval = 50 * ((float) gbformat / (float) totalformat);
return (int) (rangeval + formatval) - 1;
}
/** *//**
* 解析为big5的或然率
*
* @param content
* @return
*/
private int getProbabilityByBIG5Encoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, bfchars = 1;
float rangeval = 0, formatval = 0;
long bfformat = 0, totalformat = 1;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i]
&& content[i] <= (byte) 0xF9
&& (((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E) || ((byte) 0xA1 <= content[i + 1] && content[i + 1] <= (byte) 0xFE))){
bfchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
if (0x40 <= content[i + 1] && content[i + 1] <= 0x7E){
column = content[i + 1] - 0x40;
} else{
column = content[i + 1] + 256 - 0x61;
}
if (Big5format[row][column] != 0){
bfformat += Big5format[row][column];
} else if (3 <= row && row <= 37){
bfformat += 200;
}
}
i++;
}
}
rangeval = 50 * ((float) bfchars / (float) dbchars);
formatval = 50 * ((float) bfformat / (float) totalformat);
return (int) (rangeval + formatval);
}
/** *//**
* 在utf-8中的或然率
*
* @param content
* @return
*/
private int getProbabilityByUTF8Encoding(byte[] content){
int score = 0;
int i, rawtextlen = 0;
int goodbytes = 0, asciibytes = 0;
// 检查是否为汉字可接受范围
rawtextlen = content.length;
for (i = 0; i < rawtextlen; i++){
if ((content[i] & (byte) 0x7F) == content[i]){
asciibytes++;
} else if (-64 <= content[i] && content[i] <= -33
&& i + 1 < rawtextlen && -128 <= content[i + 1]
&& content[i + 1] <= -65){
goodbytes += 2;
i++;
} else if (-32 <= content[i] && content[i] <= -17
&& i + 2 < rawtextlen && -128 <= content[i + 1]
&& content[i + 1] <= -65 && -128 <= content[i + 2]
&& content[i + 2] <= -65){
goodbytes += 3;
i += 2;
}
}
if (asciibytes == rawtextlen){
return 0;
}
score = (int) (100 * ((float) goodbytes / (float) (rawtextlen - asciibytes)));
// 如果不高于98则减少到零
if (score > 98){
return score;
} else if (score > 95 && goodbytes > 30){
return score;
} else{
return 0;
}
}
/** *//**
* 检查为utf-16的或然率
*
* @param content
* @return
*/
private int getProbabilityByUTF16Encoding(byte[] content){
if (content.length > 1
&& ((byte) 0xFE == content[0] && (byte) 0xFF == content[1])
|| ((byte) 0xFF == content[0] && (byte) 0xFE == content[1])){
return 100;
}
return 0;
}
/** *//**
* 检查为ascii的或然率
*
* @param content
* @return
*/
private static int getProbabilityByASCIIEncoding(byte[] content){
int score = 75;
int i, rawtextlen;
rawtextlen = content.length;
for (i = 0; i < rawtextlen; i++){
if (content[i] < 0){
score = score - 5;
} else if (content[i] == (byte) 0x1B){ // ESC (used by ISO 2022)
score = score - 5;
}
if (score <= 0){
return 0;
}
}
return score;
}
/** *//**
* 检查为euc_kr的或然率
*
* @param content
* @return
*/
private int getProbabilityByEUC_KREncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, krchars = 1;
long krformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xFE
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
krchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (EUC_KRformat[row][column] != 0){
krformat += EUC_KRformat[row][column];
} else if (15 <= row && row < 55){
krformat += 0;
}
}
i++;
}
}
rangeval = 50 * ((float) krchars / (float) dbchars);
formatval = 50 * ((float) krformat / (float) totalformat);
return (int) (rangeval + formatval);
}
private int getProbabilityByEUC_JPEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, jpchars = 1;
long jpformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if ((byte) 0xA1 <= content[i] && content[i] <= (byte) 0xFE
&& (byte) 0xA1 <= content[i + 1]
&& content[i + 1] <= (byte) 0xFE){
jpchars++;
totalformat += 500;
row = content[i] + 256 - 0xA1;
column = content[i + 1] + 256 - 0xA1;
if (JPformat[row][column] != 0){
jpformat += JPformat[row][column];
} else if (15 <= row && row < 55){
jpformat += 0;
}
}
i++;
}
}
rangeval = 50 * ((float) jpchars / (float) dbchars);
formatval = 50 * ((float) jpformat / (float) totalformat);
return (int) (rangeval + formatval);
}
private int getProbabilityBySJISEncoding(byte[] content){
int i, rawtextlen = 0;
int dbchars = 1, jpchars = 1;
long jpformat = 0, totalformat = 1;
float rangeval = 0, formatval = 0;
int row, column, adjust;
rawtextlen = content.length;
for (i = 0; i < rawtextlen - 1; i++){
if (content[i] >= 0){
} else{
dbchars++;
if (i + 1 < content.length
&& (((byte) 0x81 <= content[i] && content[i] <= (byte) 0x9F) || ((byte) 0xE0 <= content[i] && content[i] <= (byte) 0xEF))
&& (((byte) 0x40 <= content[i + 1] && content[i + 1] <= (byte) 0x7E) || ((byte) 0x80 <= content[i + 1] && content[i + 1] <= (byte) 0xFC))){
jpchars++;
totalformat += 500;
row = content[i] + 256;
column = content[i + 1] + 256;
if (column < 0x9f){
adjust = 1;
if (column > 0x7f){
column -= 0x20;
} else{
column -= 0x19;
}
} else{
adjust = 0;
column -= 0x7e;
}
if (row < 0xa0){
row = ((row - 0x70) << 1) - adjust;
} else{
row = ((row - 0xb0) << 1) - adjust;
}
row -= 0x20;
column = 0x20;
if (row < JPformat.length && column < JPformat[row].length
&& JPformat[row][column] != 0){
jpformat += JPformat[row][column];
}
i++;
} else if ((byte) 0xA1 <= content[i]
&& content[i] <= (byte) 0xDF){
}
}
}
rangeval = 50 * ((float) jpchars / (float) dbchars);
formatval = 50 * ((float) jpformat / (float) totalformat);
return (int) (rangeval + formatval) - 1;
}
}
</pre>
<p> </p>
4 楼
myy
2008-06-06
zjit 写道
有什么方法可以判断目标文件的编码吗?是文件字节流的编码?这样我就可以以相同的编码读文件!
严格来说,没有绝对有效的办法判断(如果没有任何其他标识的话),当然,有的软件比较智能,能通过分析文件部分内容“猜出”编码,比如 UltraEdit等。
你可以看看 xml, html, http header... 中都需要有专门的 encoding, charset 之类的东西.
3 楼
zjit
2008-06-06
这个是不会出现乱码。但是需求是:writeFile可以自定义编码写文件。所以不能采用默认的。有什么方法可以判断目标文件的编码吗?是文件字节流的编码?这样我就可以以相同的编码读文件!
2 楼
wennew
2008-06-06
要不然都用系统默认的编码,在writeFile时, out.write(ByteBuffer.wrap(content.getBytes())); 不使用encoding就是系统默认的编码。
1 楼
wennew
2008-06-06
你底下writeFile用的utf-8,上面private static Charset charset = Charset.forName("UTF-8");就可以了。
相关推荐
读写文件时,确保正确设置文件编码,避免乱码问题;如果多个进程同时修改同一个文件,可能需要文件锁来协调。 总之,IO文件读取是程序开发的基础技能,理解并熟练掌握不同的读取方式和处理策略,有助于编写高效、...
在Java编程语言中,输入/输出(IO)是处理数据传输的核心部分,特别是在与文件系统交互时。本文将深入探讨Java中的基本文件操作,包括文件的移动、读写以及文本文件的读写。 首先,让我们从文件操作开始。在Java中...
在Java编程语言中,IO(Input/Output)流是一组用于处理输入和输出操作的类和接口,广泛应用于文件的读写、数据的传输以及网络通信等场景。本实例主要探讨了如何使用Java IO流进行文件的读写、上传和下载,同时也...
不同的编码可能导致在读取或写入文件时出现乱码问题。 2. **`java.nio`包中的Charset类**:Java标准库提供了`java.nio.charset`包,其中的`Charset`类用于表示字符集,提供对各种字符编码的支持。例如,`...
不同的编码方式可能导致乱码问题,因此在读写文件时需明确文件的编码格式。 二、读取文本文件 1. 使用`StreamReader`:这是最常用的方法,可以处理各种编码。例如: ```csharp using (var reader = new ...
- **FileReader与FileWriter**:用于读写文本文件,自动处理字符编码。用法类似FileInputStream,但需要配合BufferedReader和BufferedWriter提升性能。 3. **缓冲流(Buffered Stream)** - **BufferedReader与...
- **使用内置API**:在Java和Android中,`java.nio.charset.Charset`类提供了编码和解码功能,可以通过`new String(byte[], charset)`和`String.getBytes(charset)`方法实现文件内容的转码。 - **流操作**:使用`...
总的来说,Java的IO框架提供了一个强大而灵活的工具集,用于处理文件读写和复制。通过组合不同类型的流,我们可以根据具体需求优化性能和功能。在实际项目中,理解并熟练运用这些流对于处理文件操作至关重要。
在实际项目中,你可能需要根据需求进行更复杂的操作,比如读取大文件时分块处理,或者处理编码问题。总的来说,C#的System.IO提供了强大且灵活的工具来处理TXT文件读写任务。通过理解流的概念和使用适当的类,你可以...
在实际项目中,可能需要根据具体需求进行更复杂的处理,例如检查文件是否存在、处理大文件、处理编码问题等。在压缩包中的`RWtxt.cs`文件应该包含了具体的实现代码,你可以参考其结构和逻辑来理解如何在项目中应用...
不同的操作系统、软件可能使用不同的默认编码格式,这就会导致在跨平台或跨软件间读写文件时出现乱码问题。因此,在处理文件时,明确指定文件的编码格式是十分重要的。 #### 三、Java读取XML文件 对于XML文件的读取...
文件复制是一个常见的I/O操作,可以使用FileInputStream和FileOutputStream结合DataInputStream和DataOutputStream实现,或者使用NIO(New IO)框架的Channels和Buffers进行高效地复制。 异常处理在文件I/O中非常...
- **编码问题**:在读取或写入文件时,需明确指定编码方式,以避免乱码问题。 #### 六、总结 本文详细介绍了ASP.NET中文件读写的基本操作,包括读取整个文件、按行读取、写入整个文件和按行写入等内容,并提供了...
在Java编程语言中,读写文件是常见的操作,它涉及到对磁盘上文件内容的访问。这个"java简单的读写文件小程序"很可能是用来演示如何使用Java API进行文件操作的基本概念。下面,我们将深入探讨Java中读取和写入文件的...
因此,在读写非默认编码的文件时,如果不指定正确的编码格式,就会出现乱码。 #### 解决方案 为了确保读写操作正确无误,可以采取以下步骤: 1. **识别文件编码**:首先,需要确定文件的实际编码格式。可以通过...
- 文件读写时需考虑字符编码问题。默认的编码可能不适用于所有系统,可以使用`Charset`类指定编码,如`new InputStreamReader(fileInputStream, StandardCharsets.UTF_8)`。 9. **流的关闭**: - 操作完成后,...
在Java和Android开发中,文件流(IO,Input/Output)是进行数据读写的核心机制。文件流IO允许程序从磁盘、网络或其他输入源读取数据,或将数据写入到输出目标,如磁盘、网络或打印机。下面将详细探讨文件流IO的基本...
在Java编程语言中,文件操作是一项基础且至关...以上就是关于"Java实现文件复制,File文件读取,写入,IO流的读取写入"的主要知识点。通过理解并熟练应用这些概念和方法,开发者可以有效地处理Java环境中的文件操作。
`BufferedReader`用于字符流读取,而`InputStreamReader`用于字节流读取,如果需要处理编码问题,如UTF-8,就需要用到它。以下是一个使用`BufferedReader`读取文件的简单示例: ```java File file = new File("path...
本文将深入探讨VB.NET中如何进行二进制文件的读写,这是一项核心的IO操作技能。 二进制文件通常用于存储非文本数据,如图像、音频、自定义数据结构等。与文本文件不同,它们不使用字符编码,而是直接存储原始字节流...