theortetical question

Dec 31, 2013 at 6:30 PM
Hi,

Since this is a cross platform library and in Linux all strings are UTF-8, why do this library use wchar_t in Windows? I mean, this library has minimal (if any?) interaction with the Windows GUI subsystem, so it can just go UTF-8 all the way can't it?

I am just curious of this decision,

thanks,

G.
Dec 31, 2013 at 6:32 PM
I meant 'Theoretical', sorry!!

G.
Jan 2, 2014 at 4:58 PM
Hi G,

Some of the underlying Windows APIs we work with directly use UTF-16 strings, however this isn't the only reason we made this decision. We felt it was better on Windows to always present UTF-16 strings so developers don't have to do many conversions themselves.

We are always looking for feedback. Have you found this to be a pain or are there locations where you want to get a UTF-8 string on Windows but are unable to and need to make a conversion yourself?

Thanks,
Steve
Jan 2, 2014 at 5:34 PM
Hi Steve,

In my case the issue is that I have really big structures in memory. Most of this data is composed of binary buffers so TCHAR is not an issue but, I also have a lot of tags (field names, xml tags. etc). Because of that, if I use TCHAR and UNICODE macro in Windows, the size increases significantly.

The strategy I adopted is to have everything which is text encoded as UTF-8 and do the conversions ONLY on the edge where I interact with the Windows GUI subsystem. Yes at that point I have to convert everything coming from Windows controls to UTF-8 but once done, everything else including a lot of field names, xml tags, etc, are just plain ASCII text (even for Asian languages, the name of fields in the app and the xml tags are plain ASCII text - nobody cares except the developers :)). Also streaming them out to be sent thought the Internet is fast because what needs to be UTF-8 is already converted. The added benefit of this is easier Linux integration, at least for non gui stuff.

Thanks,

G.
Jan 3, 2014 at 3:15 PM
Hi G,

Yes I can understand how in this situation you wish to keep storing your data in UTF-8 to save space. In some areas of our API we do have APIs for using single byte character strings on Windows. With the http library on the http_request class there is a set_body(...) method which takes a std::string. To receive an http_response body directly as a std::string you could use the facility we have for setting the underlying stream to write the response body to, by using the http_response::set_response_stream(...) method. A container_buffer backed by std::string could be used. This also is the most efficient way to receive a response body since it guarantees absolutely no copies, although it isn't the cleanest looking code wise.

Here is a small snippet of what this would look like:
#include "cpprest/http_client.h"
#include "cpprest/container_stream.h"
using namespace concurrency::streams;
using namespace web::http;
...
http_request request(...);
// create a stream backed by a std::string
container_buffer<std::string> responseData;
request.set_response_body(responseData.create_ostream());
auto response = client.request(request).get();
// Wait for all response data to be received.
response.content_ready().wait();
// Get string out of stream without any copies
std::string data = std::move(responseData.collection());
Are there other areas of our library that you find yourself wishing we have support for single byte character strings on Windows in particular?

Thanks,
Steve
Jan 3, 2014 at 3:33 PM
Hi Steve,
This also is the most efficient way to receive a response body
Yes I understand! To recap, you get the data as UTF-8 encoded most of the time from the Internet (actually probably ALL the time), and using this technique, ensure that I as a user, get it without ANY extra conversions of any kind. This has a double advantage because most of the time, the code getting this data will still be common between Windows and Linux. Later where the data is actually consumed, it can be modified as required.

Thanks,

G.
Jan 3, 2014 at 7:26 PM
Yes, exactly most strings across the internet are going to be UTF-8. And just to reiterate setting the response stream isn't as productive from a coding perspective but is efficient because of the following reasons:
  1. That data will never be copied since as in comes in it is directly written into the specified stream.
  2. No conversions will be made, avoiding further copies.
  3. If you have an idea of the size of the data that will come in you can make further optimizations using APIs like std::string::reserve(...) to avoid repeatedly heap allocations.
You will however have to know/check the Content-Type header to decide yourself how to treat the data. As opposed to the http_response::extract_string() method which will check the Content-Type for you and handle several different types of charset encodings.

Let me know if you have any other questions or concerns.

Thanks,
Steve