Writing UTF-8 files in C++

Let’s say you need to write an XML file with this content:

How do we write that in C++?

At a first glance, you could be tempted to write it like this:

When you open the file in IE for instance, surprize! It’s not rendered correctly:

So you could be tempted to say “let’s switch to wstring and wofstream”.

And when you run it and open the file again, no change. So, where is the problem? Well, the problem is that neither ofstream nor wofstream write the text in a UTF-8 format. If you want the file to really be in UTF-8 format, you have to encode the output buffer in UTF-8. And to do that we can use WideCharToMultiByte(). This Windows API maps a wide character string to a new character string (which is not necessary from a multibyte character set). The first argument indicates the code page. For UTF-8 we need to specify CP_UTF8.

The following helper functions encode a std::wstring into a UTF-8 stream, wrapped into a std::string.

With that in hand, all you have to do is doing the following changes:

And now when you open the file, you get what you wanted in the first place.

And that is all!

Hits for this post: 45719 .

19 comments untill now

  1. If i have a string(WideChar) that contains UTF-8 character and I want to write it on a file, How do I do that?

  2. mariusbancila @ 2009-02-17 15:02

    If it’s already UTF_8 then write it just like you’d write any other string. Use ofstream. You can see the last example in my post. Or maybe I’m missing something and didn’t understand you well.

  3. Torkel Bj?rnson @ 2009-02-25 01:54

    You can also do this:

    wofstream file(“test”);
    file.imbue(locale(“en_US.utf8″)); // can throw
    file << L”this is a na??ve example” << endl;

    Or if you know your default locale is UTF-8

  4. locale(“en_US.utf8”); doesn’t work in Visual Studio. For “English_United States.1252” you must use locale(“English”) but this does not set it to UTF-8.

    This method seems to work only on linux.

    Mr. Marius’s example is the only working method for converting wchar_t to char UTF-8 on Windows/Visual Studio.

    Ad as far as the encoding goes… Notepad seems to detect automatically the required encoding to display UTF-8 as far as I tested.

  5. David Coorey @ 2009-03-18 01:30

    Thanks for this article, Marius. It was exactly what I needed to know.

    I sometimes wonder how I managed to do my job before the internet was around…

  6. I used this in order to create UTF-8 files:
    _wfopen(strFile, L”wt, ccs=UTF-8″);

    Nothing else needed. 😀

  7. […] about UTF-8 Encoding. Then he gave a C++ code example of convert from/to UTF-16 to UTF-8. This is another example of writing UTF-8 in […]

  8. Sorin, if you’re using _wfopen() with css=UTF-8, you automatically add BOM chars to the file beginning. The BOM should create more disadvantages then advantages.
    For instance, if you open again the same file created with the same parameters (just to change your file content) then you must be careful to remove the old BOM chars section, because otherwise you have two BOMs. 😀
    I met this situation in the past into a bug of one of my colleges and it isn’t to pretty.
    I use std::ofstream class and a class conversion that contains a similarly conversion method, Marius’s sample.

  9. hi,i need a help.pleasssssssse
    i want to write a text like:

    “i am a student.i’m studing computer science at school.i love programming.i want to be very good in it.”

    in a text file.dat or .txt and then read it in c++
    in order to find out how many time, say computer word, repaeat in the text

  10. @shahab, sorry this is not a forum where you can ask questions like this. I suggest to bring this problem in a forum like http://www.codeguru.com/forum.

  11. Hi there,

    thanks for the code it’s exactly I was looking for.
    Could you just help me out with one more thing? How can I read utf-8 files?

  12. […] set up some form of automatic conversion that hooks into the C++ streams library. For example, see Writing UTF-8 files in C++ by Marius Bancila. This is information I’m going to keep in mind, but my testing with GCC 4.5 […]

  13. quandaso @ 2011-12-15 15:49

    Thanks, very useful!

  14. would be nice to see a linux compatible article

  15. Hmm, there’s two things I don’t like about this:

    1. Casting away a const and then writing to the underlying buffer is undefined behavior. As of C++11 you could use &newbuffer[0] which gives you a non-const pointer and is designed for this purpose.

    2. Storing utf-8 in the buffer of a string is surely asking for trouble… What if someone later attemps to use one of strings algorithms on the buffer? Will it work? Maybe if the utf-8 characters all happen to be narrow, probably not if there are any wide ones in there. This is also basically undefined behvior and the only thing you could safely do with that std::string is immediately write it out.

  16. hamed ahmad @ 2012-04-09 13:54

    Thanks A Greate example I needed
    I got some problems with appendig to wstring but there is allways a work around

  17. Ok, about the portable variant. It is easy, if you use C++11 standard (cuz there are a lot of new includes like “utf8.h” to do this). But if you want to create multiplatform code with elder standards, you can use this method (like I used) to write with streams:
    1. Read the article about UTF converter for streams from this link (http://www.codeproject.com/Articles/38242/Reading-UTF-8-with-C-streams)

    2. Add “stxutif.h” to your project from sources above

    3. Open file in ANSI mode and add BOM to the start of a file first of all, like this:
    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;

    4. Then open file as UTF and write your content there:
    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    std::locale utf8_locale(std::locale(), new utf8cvt);

    fs<<..//write anything you wan…

  18. marius : can u plz tell me how to open a file and read which is in utf-8 format.

  19. Very useful Marius, many thanks!

Add your comment now