Upload in chunks with a transformation

Oct 28, 2015 at 11:58 AM
Hi,

I'm working on an application that needs to upload very large files to a webserver, and apply a specific transformation (end to end encryption) to the data before it is sent to the server. Because the files are very large, I want to avoid:

1) Reading the whole file into memory before encrypting.
2) Writing a temporary file that is encrypted to disk and uploading that.

What I really want to do is read the file from disk in chunks, apply the transformation and then upload the transformed chunk, and then move on to the next chunk.

I've investigated a number of different ways of doing this. (producer consumer, subclassing streambuf, custom pipeline stage). What is the recommended approach? Any details would be very welcome.

Thanks,
Chris
Coordinator
Oct 29, 2015 at 7:14 AM
All three of the solutions you proposed seem very viable.

I think the custom pipeline stage will be the easiest to implement, but has the obvious restriction of only working in the http pipeline.
Oct 30, 2015 at 1:07 PM
Great, thanks a lot for this advice. Is the assumption correct then, that the pipeline stage will be called each time that that the http_client reads a chunk from disk (using the file stream that's been passed into the body)?

Also, how would this approach work with sending multiple files (in one http request) with different Content-Disposition for each file?

In general it would be fantastic to get a tutorial on how to customise the concurrent streams and streambuffers. Something equivalent to: http://www.mr-edd.co.uk/blog/beginners_guide_streambuf would be really helpful!

Thanks,
Chris
Oct 30, 2015 at 5:03 PM
So I did some experimentation and it seems to me that the custom pipeline just gets called once at the beginning of the transfer, and effectively just contains the body stream that I pass in. It's not clear to me how this can be used to perform a transformation on parts of the data without reading the whole thing into memory, which is the same problem that I face without using a custom pipeline.

I feel like maybe I'm missing something!? :)
Coordinator
Oct 30, 2015 at 5:22 PM
Oh, dear. Yeah, you're right, it only calls the pipeline once so you won't be able to do the encryption directly there. It did seem a bit too simple to work ;).

I think you will need to subclass streambuf. If you tried to just encrypt the bytes yourself and push them into a producer consumer, there's no good negative feedback mechanism to ensure you're only filling it up to a certain point. The streams are designed so that each copy of the "read head" gets its own counters for how much is available, so you can't actually tell how much the http_client has read by just querying the stream for how much it has "remaining".

For subclassing streambuf, it shouldn't actually be too bad; you just capture the other stream and whenever you're "read from" you can pass those calls down to your wrapped stream. For extra bonus usability points, you could use a custom pipeline stage to wrap the encrypting stream around all the message bodies automatically (if that meets your requirements), but it isn't needed.
Coordinator
Oct 30, 2015 at 5:24 PM
Just in case this wasn't clear: if you write a purely transformative wrapper around the "inner stream" then your memory usage should be determined by what that stream is. If you're using the file stream, for example, you should get minimal memory usage. Obviously, if you wrap a std::vector, you will still have the whole payload loaded at once :).
Oct 30, 2015 at 5:32 PM
Thanks man for the clarification, that explains my confusion! :) Yeah, I realised the same with the producer-consumer buffer... you'd end up reading much faster from disk than you can send over the network and end up with most of the file in memory.

Regarding the streambuf, could you provide a small snippet/example of how this can be done? I've been looking through the code and haven't been able to determine how to get this to work.

Thanks,
Chris