Half-encoded URI issues

Oct 15, 2013 at 3:36 PM
I have an URI that looks like http://server:port/link/to/page?param1=Value%20with%20spaces&param2={someguid}

When I try to use that URI to build a web::http::uri object:
  • web::http::uri constructor throws an exception because { is not a valid query character;
  • web::http::uri::encode_uri static function re-encodes my % characters to %20 (controlled by the should_encode lambda that is passed to encode_impl) and therefore my URI is no more the same.
Therefore: is there a way for me to construct a web::http::uri object from my initial string, or do I have to transform it myself first?
Coordinator
Oct 15, 2013 at 10:31 PM
Have you tried the http::uri::decode() api?
   auto decoded_uri = http::uri::decode(input_str);
   auto encoded_uri = http::uri::encode_uri(decoded_uri);
   auto uri1 = http::uri(encoded_uri);
Also, I would recommend using the http::uri_builder to construct a uri from different parts.
Oct 16, 2013 at 10:48 AM
Thank you, that works.
Nov 8, 2013 at 1:27 PM
Edited Nov 8, 2013 at 4:24 PM
Now I run into a second, very similar problem:

I have an URI that looks like http://server:port/link/to/page?param1=Value%20with%20spaces&param2=şőmẹthing_nõn_ASCİI.

When trying to use http::uri::decode on that string, I get an exception because as per the source, // encoded string has to be ASCII.

Here are two solutions I can come by:
  1. the portable one:
    CMyURI::CMyURI(const utility::string_t &user_provided_uri)
    : m_uri(nullptr) // <-- this is a unique_ptr<web::http::uri>
    {
        const char hexadecimal_digits[16] = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
        char escape[3] = {'%','0','0'};
    
        std::string u8_provided_uri = utility::conversions::to_utf8string(user_provided_uri);
        for(int i = u8_provided_uri.size() - 1; i >= 0; --i)
        {
            // Backwards replace UTF8 non-ASCII byte sequences
            unsigned char c = u8_provided_uri[i];
            if(c >= 0x80 && c <= 0xff)
            {
                escape[1] = hexadecimal_digits[(c >> 4) & 0x0f];
                escape[2] = hexadecimal_digits[c & 0x0f];
                u8_provided_uri.replace(i, 1, escape, sizeof(escape)/sizeof(escape[0]));
            }
        }
    
        const utility::string_t safe_for_decode_uri = utility::conversions::to_string_t(u8_provided_uri);
        const utility::string_t decoded_uri = web::http::uri::decode(safe_for_decode_uri);
        m_uri.reset(new web::http::uri(web::http::uri::encode_uri(decoded_uri)));
    } // CMyURI
    
    
  2. the mono-platform one:
    #include "Wininet.h"
    #pragma comment(lib, "Wininet.lib")
    CMyURI::CMyURI(const utility::string_t &user_provided_uri)
    : m_uri(nullptr) // <-- this is a unique_ptr<web::http::uri>
    {
        DWORD buf_size = 1;
        TCHAR dummy_tchar;
        InternetCanonicalizeUrl(user_provided_uri.c_str(), &dummy_tchar, &buf_size, ICU_NO_META); // get buffer size
        std::unique_ptr<TCHAR[]> canonicalized_provided_uri (new TCHAR[buf_size]);
        if(!InternetCanonicalizeUrl(user_provided_uri.c_str(), canonicalized_provided_uri.get(), &buf_size, ICU_NO_META))
        {
            // handle error
        }
    
        const utility::string_t decoded_uri = web::http::uri::decode(canonicalized_provided_uri.get());
        m_uri.reset(new web::http::uri(web::http::uri::encode_uri(decoded_uri)));
    }
    
    
Are those solutions a good approach in your opinion? Thank you for your answer.


EDIT Solution 2 does not work: the encoded URI must decode to an UTF8 valid string and InternetCanonicalizeUrlW will for example encode é into %E9 instead of %C3%A9.
Coordinator
Nov 9, 2013 at 12:24 AM
Hi,

Having a partially encoded URI string is a bad input. Once you have a string in this form (with both encoded and non encoded components together), there is no ideal programmatic "solution". I would encourage not getting into this state.

If you could get the URI components separately, we can build the URI even when some of the components are encoded and some are not. For instance:
    uri_builder ub(U("http://server:port/link/to/page"));
    ub.append_query(U("param1=Value%20with%20spaces"), false); // false => the string is already encoded 
    ub.append_query(U("param2=şőmẹthing_nõn_ASCİI"), true); // true => encode the specified query string
    auto uri1 = ub.to_uri();
The solution#1 is actually encoding the non ascii characters first. It might work in this case but I would be careful relying on it heavily.
My suggestion is to see if you can get a valid input string, one that is either encoded or decoded.

Thanks
Kavya.
Nov 12, 2013 at 3:30 PM
Hello Kavya, and thank you for having taken the time to answer.

These URI strings may be ill-encoded, but unfortunately I don't have any control on them because they come from an external source. So even though I like your suggestion, it is not possible for me. However it seems that the above UTF8 conversion works in practice.

I also have encountered URIs in which the query, params or fragment parts contain reserved characters such as [. This also is generating another kind of error. In order to solve this last error, I found that the best way was to re-use the code from the inner_parse method in uri_parser.cpp, but instead of returning false when encountering unexpected characters in the query, params or fragment parts, manually percent-encode those characters, and then call the uri constructor.

I would really like it if the REST SDK included a method to parse such URIs and attempt at best to repair them, because doing that manually in my code gives me the feeling that my program is fragile. Do you think this feature is worth considering?

Thank you again for reading.
Benoit
Coordinator
Nov 15, 2013 at 1:29 AM
Hi Benoit_Mortgat

I am looking into your suggestion. Will get back to you soon on this one.

Thanks
Kavya
Marked as answer by Benoit_Mortgat on 11/22/2013 at 5:50 AM
Dec 10, 2013 at 4:38 PM
Hello Kavya,

May I ask you whether the suggestion has been studied?
Thank you

Benoit
Coordinator
Dec 11, 2013 at 12:14 AM
Hi Benoit_Mortgat,

This is still in our backlog of potential features to consider.

There are some challenges with implementing this. Consider the input URI string has both encoded and unencoded components.

1.Regarding the first suggestion, only percent encoding the non ascii characters will not be sufficient. What if the non ascii string also has characters like '%'. We will have to percent encode it.
The new API will have to make certain assumptions like if the string has a '%' character followed by two hexadecimal digits, assume that it is already encoded. If not, go ahead and percent encode the '%' character too.
Which implies, if your input string is something like: "http://localhost/encodedstr1%C5%9Fend/encodedstr2%xyend" the new API will end up not encoding the first path component "encodedstr1%C5%9Fend" and encoding the second one "encodedstr2%xyend" while the first component could have been a valid unencoded string.

2.When the URI components have reserved characters: It is difficult to break the URI into its components. For instance, if the unencoded string has the path delimiter character '/' in it, we cannot tell if the parser should treat this as the path delimiter or as a character that needs to be encoded.

What do you think of the above two cases, can your application also run into such inputs?

Since this is not a straightline scenario, but an extended feature, we may not be able to get to it by the next release.
I will definitely keep you posted if we make any progress here.

Thanks
Kavya
Dec 11, 2013 at 2:31 PM
@kavyako, is the URL starting with // (without http) valid in Casablanca? For instance //blogs.msdn.com/vcblog is valid for IE et el. seems like the protocol is automagically resolved (or just a default http) !!
Coordinator
Dec 11, 2013 at 9:23 PM
TheDeeds,

Currently, Casablanca does not resolve the URI scheme (to http) when the URL is starting with //.
I looked into the RFC 3986 and it does mention that "//" should get resolved the default scheme, section 5.4:
a relative reference is transformed to its target URI as follows:
"//g" = "http://g" We will fix this for one of our upcoming releases.
Thank you for reporting the issue.

Thanks
Kavya
Dec 17, 2013 at 8:46 AM
Edited Dec 17, 2013 at 8:51 AM
Hello Kavya,

My application collects URIs while crawling a web site. I checked that every major web browser (IE11, Chrome, Firefox) can navigate the web site and accept the bad URIs I reported previously. However regarding your question 1, I noticed that IE11 rejects such URIs (see screenshot), whereas both Chrome and Firefox send GET /encodedstr1%C5%9Fend/encodedstr2%xyend HTTP/1.1 so their URI parser does not fail (but I did not delve into the code to see how they are parsed). Still, if IE11 rejects that URI, they are probably rarely found in real life.

IE11 error message

Regarding your 2nd question, could you bring an example of URI where / would be ambiguous? I cannot see where I would consider encoding that character. Maybe ?, # or & would be more challenging.