I have some questions about using Casablanca in practice, I hope this is the right place to ask! :-)
Suppose I'm reading some data from the web -- specifically, the data is in the CSV (Comma Separated Values) format. The source is a web server -- as usual, the relevant variable is "response" of type "http_response" -- similarly to the one
in the following, canonical example:
For performance reasons, I would like to start parsing (or tokenizing, to be precise) asynchronously -- as soon I'm receiving (a row of) the data.
First, my "big" questions are:
- What are the best practices here?
- Are there any examples that would help me / guide me here (I'd also appreciate the applicable dos and don'ts)?
More details, some initial ideas (really at the thinking out loud stage for now), and more questions follow :-)
For more details of the CSV format, see RFC 4180:
However, for simplicity, feel free to also assume that:
- at most one (the first, referred to as "the header") row will contain non-numerical entries (e.g., "variable_name_1",variable_name2)
- the remaining rows (collectively referred to as "the body") will contain numerical entries (with the decimal mark always being a period, never a comma), with no missing values (i.e., the data is always shaped as a rectangular array -- never as
a ragged array)
// optionally, assume the first column (referred to as "the date") will contain a date string (e.g., a day in the ISO 8601 format: YYYY-MM-DD)
For instance, here's a possible input:
I will most likely use Boost.Tokenizer in the beginning:
// I just happen to be most familiar with it, having already used it in my previous projects.
That being said, if it simplifies the analysis, one may as well assume using std::strtok:
// I can just translate the explanations, I think :-)
// Depending on the simplicity/performance trade-off I may later consider switching to Boost.Spirit:
// In particular, possibly using boost::spirit::qi::phrase_parse:
Right now, I'm thinking of starting the parsing in the following code block:
// this is a bit similar to the BingRequest example:
// except that I'm allocating "fileBuffer" on the stack (which implies value semantics), instead of using make_shared (which would imply pointer/reference semantics).
Obviously, this should no longer be a "file_buffer".
I'm wondering, would a "basic_producer_consumer_buffer" be a good choice here?
// I admit I'm being vaguely inspired by the pattern in the PPL book here -- in particular, the Async Pipelines pattern:
-- but perhaps this is an overkill for this task?
Or perhaps a "basic_stdio_buffer" or a "basic_container_buffer" which I could then hand to, say, boost::tokenizer object? // Only satisfying the ForwardIterator concept is required of the input, so this shouldn't be a problem?
// Perhaps std::async with the std::launch::async policy would be the way to go -- i.e., to launch a function doing the work using a boost::tokenizer? (If so, a follow-up question: is it better to create a thread-local copy or rather have a single, say, static,
instance -- since the data comes in a sequential manner anyway, I shouldn't face any race conditions in this case?)
// If I were planning to use "std::strtok", I suppose "rawptr_buffer" could also be an option, but ultimately I'm not.
Next, obviously I won't call the "read_to_end" member function, since I can only assume the "until reaching the end of the stream" part implies blocking (and thus defeats the purpose).
I know that I can also call the "extract_string" member function directly on the "response" object and thus obtain a task, but I'm not sure at all whether that's the way to go (can I even use thus obtained task to read & tokenize partial
data, as it comes along -- or do I have
to call "get" which, again, would be blocking)?
As you can tell, I've got quite a lot of questions! :-)
As such, I'd appreciate any
hints / suggestions / comments / ideas! :-)
Hey, perhaps parsing CSV data is lightweight enough that I should just do it synchronously? ;-)
// But that would just be too easy ;]