Strategy for transferring Large amount of data
I am writing my master thesis and is in contact with a digital signage company where im writing about the distribution of large amount of data. I need some ideas or some documented experiences with transfering large amount of data (this is images and video, ~100Mb - ~1Gb - but any data would do, large datasets would give some of the same problems) to multiple clients.
Does anyone know of a method i could look into on how to approach this in a structured manner, or at least point me in a direction (other thesis, books, papers, people).
My main approach right now is to resolve a few things: 1. How can i make sure data is intact when it arrives (not corruptet, the .png will still work) 2. How can i determine if i received all data? 3...?
Any input is welcome, the current approach is streaming via WebServices, im going to look into BitTorrent aproach (P2P), but this seems not to be a proper strategy since each client can be showing different content.
Could a开发者_StackOverflow中文版ny of you guys out there work for a digital signage company tell me a bit on how your approach for this is? Or if you have any experience on moving large datasets from server to client... what is your approach?
My experience is with moving very large datasets between machines - data streams of about 140Gb in a computational and transactional environment- so my comments may be overkill for your query. There were a few basic principles we had to incorporate in our applications:
1 make the data streaming interruptible - so if there was ever any problem with the data transfer the application it would only ever have to be necessary to "rewind" a small percentage of the data - in my application we had to ensure that no problem would ever cause the loss of more than one hours calculation. This implied interrupting, from within the program, the processing every (in my case hour) to check the logging and correction data were up to date and correct, in addition to continuous logging of the data and calculations.
2 check for correctness for each block of processing and any sub divisions of the data (your description implies there will be no subdivisions). This implied logging data about the transfer - number of transactions; number of bytes transferred; and a checksum on each block of data, which was logged separately from other data.
3 Log the computations done on the data, both locally, and remotely (this is what gave rise to the requirement to transfer the data). The data had to be transferred to a backup site which had to be robust under failure on the primary machine. This implied a remote geographical site; with its own power source on a different part of the national power grid; with independent dedicated communications lines and servers, etc.
4 Our solutions implied the need for the remote site to acknowledge receipt of each data block, and its separately calculated logging data for comparison with the source. Obviously this checking could be done at either end, but we chose to do it on the recipient system.
These principles were necessary because of the volume of the data; the computational cost of doing all the calculations; and the sensitivity of the data.
The primary questions you have to answer before selecting a solution are:
How much does it matter if the data is corrupted or lost?
How frequently will the data be refreshed at the destination?
If the destination detects a problem, can it request a resend?
How replaceable is the data?
FTP
Here I will do the research for you...
http://en.wikipedia.org/wiki/File_Transfer_Protocol
another link:
http://en.wikipedia.org/wiki/Category:File_transfer_protocols
Standford has got some programs in the past where in they took help of cluster computing using machines around the world to process huge amounts of data.
Generally they break the data into several independent parts and send it over to laptops around the world which are then processed in the background. For this, the user need to install a small software. After data is processed and upload to standford servers by it's software. I don't remember the name of it but u can search in stanford site. It should be some of their gene research labs.
My experience, is at a company working on advertising on the top of taxi cabs. While, I didn't work on the ad distribution portion of the system, I think most of the communication was with XMPP and ad distribution using BitTorrent. Being on a mobile system as we were, the network goes up and down a lot, so whatever you choose needs to not start over downloads on failure.
Must use streams (buffered) - do not attempt to store the entire thing in memory. Break down and verify small parts with a checksum or hash strategy (if your worried about data integrity, you might want a keyed hash).
Pretty simple to handle large amounts of data these days...
精彩评论