Async'ly Parsing the Data from the Web - Best Practices?

Jul 8, 2013 at 8:29 PM
Edited Jul 8, 2013 at 8:53 PM
Hi! :-)

I have some questions about using Casablanca in practice, I hope this is the right place to ask! :-)

Suppose I'm reading some data from the web -- specifically, the data is in the CSV (Comma Separated Values) format. The source is a web server -- as usual, the relevant variable is "response" of type "http_response" -- similarly to the one in the following, canonical example: http://casablanca.codeplex.com/wikipage?title=HTTP%20Client&referringTitle=Documentation

For performance reasons, I would like to start parsing (or tokenizing, to be precise) asynchronously -- as soon I'm receiving (a row of) the data.

First, my "big" questions are:
  • What are the best practices here?
  • Are there any examples that would help me / guide me here (I'd also appreciate the applicable dos and don'ts)?
More details, some initial ideas (really at the thinking out loud stage for now), and more questions follow :-)

For more details of the CSV format, see RFC 4180: http://tools.ietf.org/html/rfc4180
However, for simplicity, feel free to also assume that:
  • at most one (the first, referred to as "the header") row will contain non-numerical entries (e.g., "variable_name_1",variable_name2)
  • the remaining rows (collectively referred to as "the body") will contain numerical entries (with the decimal mark always being a period, never a comma), with no missing values (i.e., the data is always shaped as a rectangular array -- never as a ragged array)
    // optionally, assume the first column (referred to as "the date") will contain a date string (e.g., a day in the ISO 8601 format: YYYY-MM-DD)
For instance, here's a possible input:
a,b,c
1.1,2.2,3.3
1,2,3
11,22,33.3
I will most likely use Boost.Tokenizer in the beginning: http://boost.org/libs/tokenizer/
// I just happen to be most familiar with it, having already used it in my previous projects.

That being said, if it simplifies the analysis, one may as well assume using std::strtok: http://en.cppreference.com/w/cpp/string/byte/strtok // I can just translate the explanations, I think :-)

// Depending on the simplicity/performance trade-off I may later consider switching to Boost.Spirit: http://boost.org/libs/spirit/ // In particular, possibly using boost::spirit::qi::phrase_parse: http://www.boost.org/doc/libs/release/libs/spirit/doc/html/spirit/qi/reference/parse_api/iterator_api.html

Right now, I'm thinking of starting the parsing in the following code block:
.then([=,&fileBuffer](http_response response)
{
    return response.body().read_to_end(fileBuffer);
}
// this is a bit similar to the BingRequest example: http://casablanca.codeplex.com/SourceControl/latest#Release/collateral/Samples/BingRequest/bingrequest.cpp

// except that I'm allocating "fileBuffer" on the stack (which implies value semantics), instead of using make_shared (which would imply pointer/reference semantics).

Obviously, this should no longer be a "file_buffer".
I'm wondering, would a "basic_producer_consumer_buffer" be a good choice here?
// I admit I'm being vaguely inspired by the pattern in the PPL book here -- in particular, the Async Pipelines pattern: http://msdn.microsoft.com/en-us/library/gg663538.aspx#sec8 -- but perhaps this is an overkill for this task?

Or perhaps a "basic_stdio_buffer" or a "basic_container_buffer" which I could then hand to, say, boost::tokenizer object? // Only satisfying the ForwardIterator concept is required of the input, so this shouldn't be a problem?

// Perhaps std::async with the std::launch::async policy would be the way to go -- i.e., to launch a function doing the work using a boost::tokenizer? (If so, a follow-up question: is it better to create a thread-local copy or rather have a single, say, static, instance -- since the data comes in a sequential manner anyway, I shouldn't face any race conditions in this case?)

// If I were planning to use "std::strtok", I suppose "rawptr_buffer" could also be an option, but ultimately I'm not.

Next, obviously I won't call the "read_to_end" member function, since I can only assume the "until reaching the end of the stream" part implies blocking (and thus defeats the purpose).
I know that I can also call the "extract_string" member function directly on the "response" object and thus obtain a task, but I'm not sure at all whether that's the way to go (can I even use thus obtained task to read & tokenize partial data, as it comes along -- or do I have to call "get" which, again, would be blocking)?

As you can tell, I've got quite a lot of questions! :-)
As such, I'd appreciate any hints / suggestions / comments / ideas! :-)

Hey, perhaps parsing CSV data is lightweight enough that I should just do it synchronously? ;-)
// But that would just be too easy ;]
Jul 8, 2013 at 8:59 PM
Hi Matt,

Asynchronous parsing can be very frustrating. Ask yourself why you need it, and how asynchronous you need it to be. In the end, async I/O is all about taking advantage of the latency of I/O operations. In your case, you want to start parsing early, so you must be expecting large payloads to parse. Is that correct? If not, then just do it synchronously (your example above, reading to the end of the stream, is an example of how to get started with a synchronous parse).

(BTW, allocating the fileBuffer on the stack is dangerous -- please use a shared pointer instead!)

Anyway, if you really do have large payloads, then CSV is particularly convenient, in that there are two delimiters you care about: the column delimiter (comma, for example), and the record delimiter (a.k.a. end-of-line). Casablanca already has the logic to read lines asynchronously, so a good compromise solution may be to use read_line() to get a record at a time, then use some other functionality to parse each record synchronously. That should probably give you almost all the asynchrony you need.

If that's not good enough, then I suggest taking a look at the read_nnn() functions in streams.h, which do some pretty elementary parsing. read_to_delimiter() may be the one that most closely matches your scenario.

Niklas
Jul 10, 2013 at 7:24 PM
Edited Jul 10, 2013 at 7:28 PM
Hi Niklas,

Thanks for your reply!

Yes, in fact I'm expecting a wide range of data sizes. As you suggest, I'm thinking of writing the sync version first, and then async (as simple as possible at first; I'll certainly look into read_line, thanks!) -- next, I'll experiment (profile) to see whether switching to async is worth it (and when; I expect this to be dependent on the no. of columns, since that's the main cost factor in the parsers' workload; of course, the total cost will scale proportionately to the rows*cols product).

I think I'll start by using a container_buffer as in the "To access an HTTP response stream" section here:
http://msdn.microsoft.com/en-us/library/jj950083.aspx

Does it make sense?

// On a side note, regarding the fileBuffer allocation -- interesting, may I ask why is a shared ptr recommended? I'm not planning to propagate the object (or extend its lifetime) beyond the scope it's allocated in (at most I may make it static if it's worth it, but at the moment I'm not expecting a repeatedly-called-function workflow), if anything I'd perhaps use a unique_ptr (but I don't think it's necessary here either), shared ownership semantics (as implied by shared_ptr) doesn't seem to fit well here? The only thing I can think of is the potential for slicing, but I'm always passing (or capturing) by reference, so this shouldn't be a problem, either (in fact, I can't even "accidentally" capture by value without getting compiler diagnostic and/or explicitly adding mutability); am I missing something?

Another somewhat (this time about JSON) related question; the following -- http://blogs.msdn.com/b/vcblog/archive/2013/02/26/the-c-rest-sdk-quot-casablanca-quot.aspx -- mentions that: "A JSON value can also be parsed from a stream using a constructor that takes a stream reference."
Is this parsing a part of the library? If so, is it done asynchronously?

Thanks again for your help!

Matt