`
aigo
  • 浏览: 2643846 次
  • 性别: Icon_minigender_1
  • 来自: 宜昌
社区版块
存档分类
最新评论

C++ fopen、CFile如何以UTF-8编码格式读写文件

C++ 
阅读更多

 

How to write UTF-8 file with fprintf in C++

http://stackoverflow.com/questions/10028750/how-to-write-utf-8-file-with-fprintf-in-c

 

ou shouldn't need to set your locale or set any special modes on the file if you just want to use fprintf. You simply have to use UTF-8 encoded strings.

#include <cstdio>
#include <codecvt>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
    std::string utf8_string = convert.to_bytes(L"кошка 日本国");

    if(FILE *f = fopen("tmp","w"))
    fprintf(f,"%s\n",utf8_string.c_str());
}

 

Save the program as UTF-8 with signature or UTF-16 (i.e. don't use UTF-8 without signature, otherwise VS won't produce the right string literal). The file written by the program will contain the UTF-8 version of that string. Or you can do:

int main() {
    if(FILE *f = fopen("tmp","w"))
        fprintf(f,"%s\n","кошка 日本国");
}

 

In this case you must save the file as UTF-8 without signature, because you want the compiler to think the source encoding is the same as the execution encoding... This is a bit of a hack that relies on the compiler's, IMO, broken behavior.

You can do basically the same thing with any of the other APIs for writing narrow characters to a file, but note that none of these methods work for writing UTF-8 to the Windows console. Because the C runtime and/or the console is a bit broken you can only write UTF-8 directly to the console by doing SetConsoleOutputCP(65001) and then using one of the puts variety of function.

If you want to use wide characters instead of narrow characters then locale based methods and setting modes on file descriptors could come into play.

#include <cstdio>
#include <fcntl.h>
#include <io.h>

int main() {
    if(FILE *f = fopen("tmp","w")) {
        _setmode(_fileno(f), _O_U8TEXT);
        fwprintf(f,L"%s\n",L"кошка 日本国");
    }
}

 

#include <fstream>
#include <codecvt>

int main() {
    if(auto f = std::wofstream("tmp")) {
        f.imbue(std::locale(std::locale(),
                new std::codecvt_utf8_utf16<wchar_t>)); // assumes wchar_t is UTF-16
        f << L"кошка 日本国\n";
    }
}

 

The first example uses wstring_convert from C++11, but any other method of obtaining a UTF-8 encoding works too, e.g. WideCharToMultiByte. The last example uses a C++11 codecvt facet for which there's not a built-in, pre-c++11 replacement. The other two examples don't use C++11. 

 

How to Read/Write UTF8 text files in C?

http://stackoverflow.com/questions/21737906/how-to-read-write-utf8-text-files-in-c

Instead of

fprintf(fout,"%c ",character);

 

use

fprintf(fout,"%c",character);

 

The second fprintf() does not contain a space after %c which is what was causing out.txt to display weird characters. The reason is that fgetc() is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.

putchar(character) output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try

while((character=fgetc(fin))!=EOF){
    putchar(character);
    printf(" "); // This mimics what you are doing when you write to out.txt
    fprintf(fout,"%c ",character);
}

 

If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.

#include <stdio.h>
#include <stdlib.h>

/* The first byte of a UTF-8 character
 * indicates how many bytes are in
 * the character, so only check that
 */
int numberOfBytesInChar(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

int main(){
    FILE *fin;
    FILE *fout;
    int character;
    fin = fopen("in.txt", "r");
    fout = fopen("out.txt","w");
    while( (character = fgetc(fin)) != EOF) {
        for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
            putchar(character);
            fprintf(fout, "%c", character);
            character = fgetc(fin);
        }
        putchar(character);
        printf(" ");
        fprintf(fout, "%c ", character);
    }
    fclose(fin);
    fclose(fout);
    printf("\nFile has been created...\n");
    return 0;
}

 

UTF-8, CString and CFile? (C++, MFC)

http://stackoverflow.com/questions/2318481/utf-8-cstring-and-cfile-c-mfc

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read thishttp://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

  • The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
  • The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where Windicates wide characters).
  • The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
  • Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTRis always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

 

 

分享到:
评论

相关推荐

    VC实现读写文件的三种方法

    以下是如何使用CFile读写文件的例子: ```cpp CFile file; if (!file.Open("example.txt", CFile::modeCreate | CFile::modeReadWrite)) { // 错误处理 } else { // 数据读写 file.Write(data, dataSize); file....

    VS2010 MFC读写文件

    使用CFile的成员函数Open(),传入文件路径、访问模式(如CFile::modeRead表示只读,CFile::modeWrite表示写入,CFile::modeReadWrite表示读写)以及共享模式。例如: ```cpp CFile file; if (!file.Open("C:\\path\...

    VC文件读写操作

    CFile类封装了标准的C语言风格的文件操作,如fopen、fclose等,使得文件操作更加面向对象。 二、CFile类 CFile是MFC中处理文件的核心类,它提供了打开、关闭、读取和写入文件的方法。创建CFile对象时,我们需要提供...

    CStdioFile读写文件文档 VC6编写

    在VC6环境下,CStdioFile是MFC(Microsoft Foundation Classes)库中提供的一种用于文本文件操作的类,它基于标准C语言的stdio.h库,提供了简单易用的接口进行文件的读写操作。本教程将详细介绍如何使用CStdioFile...

    C/C++/VC++文件操作

    在C/C++编程中,文件操作是至关重要的,它允许我们与磁盘上的文件进行交互,包括读取、写入和处理数据。本篇将详细解释如何在VC++环境中使用MFC实现文件的读取与写入,以及可能遇到的问题。 1. **文件的读取**: -...

    C语言源代码格式化 完工 V1.04 20120226 1946.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    VC读写操作文件.zip

    在VC++(Visual C++)开发环境中,进行文件的读写操作是常见的任务,这对于创建任何类型的应用程序都是至关重要的。VC++提供了多种方法来处理文件操作,包括标准C库函数、C++ I/O流库(iostream)、MFC(Microsoft ...

    c/c++文件操作,包括MFC的文件操作

    ### C/C++ 文件操作详解及MFC框架下的应用 #### 一、标准C语言文件操作 在标准C语言中,通常使用`stdio.h`头文件中的函数进行文件操作。主要涉及的操作包括文件的打开/关闭、读写等。 **1. 打开与关闭文件** - *...

    VC++使用CStdioFile读写文件及CString类型操作

    CStdioFile是CFile的一个派生类,它提供了对标准C库中的stdio.h接口的支持,使得我们可以方便地使用fopen、fclose、fread和fwrite等函数来读写文件。而CString类则是MFC中用于处理字符串的利器,它提供了丰富的字符...

    VC中用CStdioFile读写文件的方法完整的源码

    在Microsoft Visual C++ (VC) 开发环境中,`CStdioFile` 是MFC(Microsoft Foundation Classes)库中提供的一种方便的文件操作类,它基于标准C库的`fopen`、`fclose`等函数,提供了对文件进行读写操作的功能。...

    C语言源代码格式化 完工 V1.03 20120112 1536.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    MFC的三种方式读文件 CFile CStdioFile

    使用CFile,你可以执行基本的文件打开、关闭、读写等操作。以下是一个简单的例子: 1. 首先,需要包含头文件`#include &lt;afxwin.h&gt;`或`#include &lt;afxfile.h&gt;`。 2. 在对话框的DoDataExchange()函数中,添加CFile对象...

    wince 下真正好用的文件读写

    通过指定文件路径和访问模式(如只读、写入或读写),我们可以创建一个CFile对象并进行文件操作。 2. **CArchive**:这个类用于对象的序列化和反序列化,特别适合于保存和加载包含复杂数据结构的对象。它可以在...

    C语言源代码格式化 完工 V1.05 20120229 1804.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 V1.08 20120801 1627.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    MFC_Text.rar_MFC_MFC text_MFC读写文件_visual c

    在Microsoft Foundation Classes (MFC)库中,CStdioFile类是用于处理文本文件读写的工具,它基于标准C库的stdio.h中的fopen、fclose等函数。在Visual C++环境中,利用MFC进行文件操作既简单又直观,非常适合开发...

    C语言源代码格式化 完工 V1.09 20120821 2116.7z

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    C语言源代码格式化 完工 小文版本 V1.10 20120831 0955.zip

    将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:UTF-8 2 UNICODE 2 ANSI。 处理文件夹。 GPX2KML 20120102 1630.7z 处理 将 UTF-8 格式的 字幕(或者文件) 转 ANSI 格式。 原理:查找坐标点, ...

    mfc文件操作

    总结来说,MFC 提供了一整套方便的文件操作工具,通过 `CFile`、`CStdioFile`、`CArchive` 等类,我们可以轻松地进行文件的读写、对象的序列化、文件对话框的显示等操作。同时,注意在编写代码时考虑到性能和安全性...

Global site tag (gtag.js) - Google Analytics