Writing UTF-8 files in C++

Let’s say you need to write an XML file with this content:

< ?xml version="1.0" encoding="UTF-8"? >
< root description="this is a naïve example" >
< /root >

How do we write that in C++?

At a first glance, you could be tempted to write it like this:

#include < fstream >

int main()
{
	std::ofstream testFile;

	testFile.open("demo.xml", std::ios::out | std::ios::binary); 

	std::string text =
		"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
		"< root description=\"this is a naïve example\" >\n< /root >";

	testFile << text;

	testFile.close();

	return 0;
}

When you open the file in IE for instance, surprize! It’s not rendered correctly:

So you could be tempted to say “let’s switch to wstring and wofstream”.

int main()
{
	std::wofstream testFile;

	testFile.open("demo.xml", std::ios::out | std::ios::binary); 

	std::wstring text = 
		L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
		L"< root description=\"this is a naïve example\" >\n< /root >";

	testFile << text;

	testFile.close();

	return 0;
}

And when you run it and open the file again, no change. So, where is the problem? Well, the problem is that neither ofstream nor wofstream write the text in a UTF-8 format. If you want the file to really be in UTF-8 format, you have to encode the output buffer in UTF-8. And to do that we can use WideCharToMultiByte(). This Windows API maps a wide character string to a new character string (which is not necessary from a multibyte character set). The first argument indicates the code page. For UTF-8 we need to specify CP_UTF8.

The following helper functions encode a std::wstring into a UTF-8 stream, wrapped into a std::string.

#include < windows.h >

std::string to_utf8(const wchar_t* buffer, int len)
{
	int nChars = ::WideCharToMultiByte(
		CP_UTF8,
		0,
		buffer,
		len,
		NULL,
		0,
		NULL,
		NULL);
	if (nChars == 0) return "";

	string newbuffer;
	newbuffer.resize(nChars) ;
	::WideCharToMultiByte(
		CP_UTF8,
		0,
		buffer,
		len,
		const_cast< char* >(newbuffer.c_str()),
		nChars,
		NULL,
		NULL); 

	return newbuffer;
}

std::string to_utf8(const std::wstring& str)
{
	return to_utf8(str.c_str(), (int)str.size());
}

With that in hand, all you have to do is doing the following changes:

int main()
{
	std::ofstream testFile;

	testFile.open("demo.xml", std::ios::out | std::ios::binary); 

	std::wstring text =
		L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
		L"< root description=\"this is a naïve example\" >\n< /root >";

	std::string outtext = to_utf8(text);

	testFile << outtext;

	testFile.close();

	return 0;
}

And now when you open the file, you get what you wanted in the first place.

And that is all!

19 Replies to “Writing UTF-8 files in C++”

  1. If i have a string(WideChar) that contains UTF-8 character and I want to write it on a file, How do I do that?

  2. If it’s already UTF_8 then write it just like you’d write any other string. Use ofstream. You can see the last example in my post. Or maybe I’m missing something and didn’t understand you well.

  3. You can also do this:

    wofstream file(“test”);
    file.imbue(locale(“en_US.utf8″)); // can throw
    file << L”this is a na??ve example” << endl;

    Or if you know your default locale is UTF-8
    file.imbue(locale(“”));

  4. locale(“en_US.utf8”); doesn’t work in Visual Studio. For “English_United States.1252” you must use locale(“English”) but this does not set it to UTF-8.

    This method seems to work only on linux.

    Mr. Marius’s example is the only working method for converting wchar_t to char UTF-8 on Windows/Visual Studio.

    Ad as far as the encoding goes… Notepad seems to detect automatically the required encoding to display UTF-8 as far as I tested.

  5. Thanks for this article, Marius. It was exactly what I needed to know.

    I sometimes wonder how I managed to do my job before the internet was around…

  6. I used this in order to create UTF-8 files:
    _wfopen(strFile, L”wt, ccs=UTF-8″);

    Nothing else needed. 😀

  7. Sorin, if you’re using _wfopen() with css=UTF-8, you automatically add BOM chars to the file beginning. The BOM should create more disadvantages then advantages.
    For instance, if you open again the same file created with the same parameters (just to change your file content) then you must be careful to remove the old BOM chars section, because otherwise you have two BOMs. 😀
    I met this situation in the past into a bug of one of my colleges and it isn’t to pretty.
    I use std::ofstream class and a class conversion that contains a similarly conversion method, Marius’s sample.

  8. hi,i need a help.pleasssssssse
    i want to write a text like:

    “i am a student.i’m studing computer science at school.i love programming.i want to be very good in it.”

    in a text file.dat or .txt and then read it in c++
    in order to find out how many time, say computer word, repaeat in the text
    ?hjow?

  9. Hi there,

    thanks for the code it’s exactly I was looking for.
    Could you just help me out with one more thing? How can I read utf-8 files?

  10. Hmm, there’s two things I don’t like about this:

    1. Casting away a const and then writing to the underlying buffer is undefined behavior. As of C++11 you could use &newbuffer[0] which gives you a non-const pointer and is designed for this purpose.

    2. Storing utf-8 in the buffer of a string is surely asking for trouble… What if someone later attemps to use one of strings algorithms on the buffer? Will it work? Maybe if the utf-8 characters all happen to be narrow, probably not if there are any wide ones in there. This is also basically undefined behvior and the only thing you could safely do with that std::string is immediately write it out.

  11. Thanks A Greate example I needed
    I got some problems with appendig to wstring but there is allways a work around

  12. Ok, about the portable variant. It is easy, if you use C++11 standard (cuz there are a lot of new includes like “utf8.h” to do this). But if you want to create multiplatform code with elder standards, you can use this method (like I used) to write with streams:
    1. Read the article about UTF converter for streams from this link (http://www.codeproject.com/Articles/38242/Reading-UTF-8-with-C-streams)

    2. Add “stxutif.h” to your project from sources above

    3. Open file in ANSI mode and add BOM to the start of a file first of all, like this:
    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;
    fs<<smarker;
    fs.close();

    4. Then open file as UTF and write your content there:
    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    std::locale utf8_locale(std::locale(), new utf8cvt);
    fs.imbue(utf8_locale);

    fs<<..//write anything you wan…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.