Using Git as a back-end for an update server, how to keep the repository small
Here's my use case. I have a desktop app that can download from my server media content on-demand. Every week or so, new media will be pushed/renamed/modified etc. on the server, and the clients will send me requests every day or so to check whether there are updates available that they should download.
To accurately and easily determine the new files the clients need, I was thinking of using Git on the server, and storing for each client the revision hash of the data it has downloaded. On every update request, I can then eas开发者_高级运维ily check with Git what files were added, deleted, renamed, etc. with something like git diff --name-status -C HEAD <clientRevision>
, and then send only the needed updates.
My question is: obviously, I don't need to keep the whole binary history of my media on the server. I don't care what file X looked like two months ago; I just need to know whether it was changed in the meantime, or renamed from Y to X, for instance. Is it possible to use Git in an a way such that I could get rid of the “binary history” of files while still keeping track of which files were modified, added, removed, and renamed? Or is there another obvious technological choice that I've overlooked for this kind of scenario?
(Yes, I'd love to use rsync for the whole thing; unfortunately the only thing I know from my clients is that they're running on the JVM, may use port 80, and can write to the directory that should contain the needed media files, so rsync is unfortunately not an option.)
See my comment for the real answer, but comments don't allow for proper formatting.
Here's a quick sketch of a crazy idea if you want to go with git
.
I understand that you do have control over the client devices and that you can run git on those divices. You could consider creating a mirror tree of hashes (e.g. md5/sha1 hashes) of the original binary files. Git will then look at the "hashtree" to determine what's new, and make sure to get the actual data before updating the git. Like so
/actual/somedir/imag1.jpg
/mirror/somedir/imag1.jpg <= contains md5 hash
Git is great, but not the right tool for the job. If you are uninterested in history, and have large binaries, git is just going to cause problems.
Instead, what I recommend a small SQL database for meta-information and a on-disk directory to store the media files.
First the on-dist media files: to allow corruption detection and support renaming without retransfering large media files, name the files by their SHA (or MD5 or almost any decent checksum algorithm). You can either link the "real" filename or use a translation table (possibly from a DB, possibly not) to present the good name to the user.
Second, the SQL database. Track a revision (sequence) number for each client in a table. Track the revision id each media file was last updated at. . Track the current name of each media file and the last time that name was added, renamed, or deleted (filename NULL for delete) .
Using this, you can instantly tell exactly what media files need to be sent to the user
select clientid,mediaid from tmedia join tclients on tmedia.revisionid > tclients.revisionid;
You can instantly tell exactly what new file mappings need to be sent:
select mediaid,filename,clientid from tmapping join tclients on tmapping.revisionid > tclients.revisionid;
If you ever suspect corruption (or periodically) you can validate the media on the client and server my computing the SHA and comparing it to the filename, and then looking it up in the mapping table (both client and server) and media table (server). Also, just send the latest mapping file (or partition of the mapping file or checksum of the mapping file) to validate what is going on there. Simple, easy to understand, and easy to develop.
Hallo,
You have a port 80 from client to a server you manage. I suppose you may use other client than git on the client.
Do not use git to pull data from the server. Try to use plain HTTP client and HTTP methods designed for this: HEAD to find out if the file has changed, and if yes, GET it. There is a possibility to give your server repository some layout: download the index file for that particular client, and then check each file in that index. Get inspiration from a Debian Apt repositories - the diffs, signing of files etc., if it would work for your use case. WebDav is another option to access the server, offering even more comfort. You do not talk about authentication which may be required. If the client speak HTTP, you may use (caching) proxy.
You may keep your data, the tree presented over HTTP server, in git repository. Replacing the binaries with small file containing hashes (as suggested by Klaas van Schelven), and you can even add another metadata, change log, time stams or authors of files etc.
精彩评论