I have a fat/uber JAR generated by Gradle Shadow plugin. I often need to send the fat JAR over network and therefore, it is convenient for me to send only delta of the file instead of cca 40 MB of data. rsync is a great tool for this purpose. However, a small change in my source code leads to a large change in final fat JAR and consequently rsync is not helping as much as it could.
Can I convert the fat JAR to rsync-friendly JAR?
My ideas of a solution/workarounds:
- Put the heavy weight on rsync and tell it somehow that it works with a compressed file (I didn't find any way to do it).
- Convert non-rsyncable jar to rsyncable jar
- Tell Gradle Shadow to generate rsyncable jar (not possible at the moment)
Possibly related questions:
3 Answers
Answers 1
As far as I know, rsyncable gzip works by reseting the Huffman tree and padding to byte boundaries every 8192 bytes of compressed data. This avoids long range side effect on the compression (rsync take care of shifted data blocks if they are at least byte aligned)
In this sense, a jar containing small files (less than 8192 bytes) is already rsyncable, because each file is compressed separately. As a test you could use jar's -0
option (no compression) to check if it helps rsync, but I think it won't.
To improve the rsyncability you need to (at least):
- Make sure the files are stored in the same order.
- Make sure the meta data associated to unchanged files are also unchanged, as each file has a local file header. For example the last modification time is problematic for
.class
files.
I am not sure for jar, but zip allows extra fields, some of which may prevent rsync matches, e.g. the last acces time for the unix extension.
Edit : I did some tests with the following commands :
FILENAME=SomeJar.jar rm -rf tempdir mkdir tempdir unzip ${FILENAME} -d tempdir/ cd tempdir # set the timestamp to 2000-01-01 00:00 find . -print0 | xargs --null touch -t 200001010000 # normalize file mode bits, maybe not necessary chmod -R u=rwX,go=rX . # sort and zip files, without extra find . -type f -print | sort | zip ../${FILENAME}_normalized -X -@ cd .. rm -rf tempdir
rsync stats when the first file contained in the jar / zip is removed :
total: matches=1973 hash_hits=13362 false_alarms=0 data=357859 sent 365,918 bytes received 12,919 bytes 252,558.00 bytes/sec total size is 4,572,187 speedup is 12.07
when the first file is removed and every timestamp is modified :
total: matches=334 hash_hits=124326 false_alarms=4 data=3858763 sent 3,861,473 bytes received 12,919 bytes 7,748,784.00 bytes/sec total size is 4,572,187 speedup is 1.18
So there is a significant difference, but not as much as I expected.
It also seems that changing the file mode does not impact the transfert (maybe because it is stored in the central directory ?)
Answers 2
Let's take one step back; if you do not create large jars, this ceases to be a problem.
So, if you deploy your dependency jars separately, and you don't jar them into a single fat jar, you've also solved the problem here.
To do that, let's say you have:
- /foo/yourapp.jar
- /foo/lib/guava.jar
- /foo/lib/h2.jar
Then, put in the META-INF/MANIFEST.MF
file of yourapp.jar
the following entry:
Class-Path: lib/guava.jar lib/h2.jar
And now you can just run java -jar yourapp.jar
and it'll work, picking up the dependencies. You can now transfer these files individually with rsync; yourapp.jar will be much smaller, and your dependency jars will usually not have changed, so those won't take much time when rsyncing either.
I'm aware this doesn't directly answer the actual asked question, but I bet in 90%+ of the times this question comes up, not fatjarring is the appropriate answer.
NB: Ant, Maven, Guava, etc, can take care of putting the right manifest entry in. If the intent of your jar is not to run it, but, for example, it's a war for a web servlet container, those have their own rules for how to specify where your dependency jars live.
Answers 3
Yes you can speed it up by about 40% on a new archive and by more than 200% on a jar archive you've already rsync'd. The trick is to not compress the jar so you can take advantage of rsyncs chunking algorithm.
I used the following commands to compress a directory with a log of class files...
jar cf0 uncompressed.jar . jar cf compressed.jar .
This created the following two jars...
-rw-r--r-- 1 rsync jar 28331212 Apr 13 14:11 ./compressed.jar -rw-r--r-- 1 rsync jar 38746054 Apr 13 14:10 ./uncompressed.jar
Note that the size of the uncompressed Jar is about 10MB larger.
I then rsync'd these files and timed them using the following commands. (Note, even turning on compression for the compressed file had little effect, I'll explain later).
Compressed Jar
time rsync -av -e ssh compressed.jar jar@rsync-server.org:/tmp/ building file list ... done compressed.jar sent 28334806 bytes received 42 bytes 2982615.58 bytes/sec total size is 28331212 speedup is 1.00 real 0m9.208s user 0m0.248s sys 0m0.483s
Uncompressed Jar
time rsync -avz -e ssh uncompressed.jar jar@rsync-server.org:/tmp/ building file list ... done uncompressed.jar sent 11751973 bytes received 42 bytes 2136730.00 bytes/sec total size is 38746054 speedup is 3.30 real 0m5.145s user 0m1.444s sys 0m0.219s
We have gained a speedup of nearly 50%. This at least speeds up the rsync and we get a good boost but what about subsequent rsyncs where a small change has been made.
I removed one class file from the directory that was 170 bytes in size recreated the jars mow they are this size..
-rw-r--r-- 1 rsycn jar 28330943 Apr 13 14:30 compressed.jar -rw-r--r-- 1 rsync jar 38745784 Apr 13 14:30 uncompressed.jar
Now the timings are very different.
Compressed Jar
building file list ... done compressed.jar sent 12166657 bytes received 31998 bytes 2217937.27 bytes/sec total size is 28330943 speedup is 2.32 real 0m5.435s user 0m0.378s sys 0m0.335s
Uncompressed Jar
building file list ... done uncompressed.jar sent 220163 bytes received 43624 bytes 175858.00 bytes/sec total size is 38745784 speedup is 146.88 real 0m1.533s user 0m0.363s sys 0m0.047s
So we can speed up rsyncing large jar files a lot using this method. The reason for this is related to information theory. When you compress data it in effect removes everything that's common from the data ie what you're left with looks very much like random data, the best compressors remove more of this information. A small change to any of the data and most compression algorithms have a dramatic effect on the output of the data.
The Zip algorithm is effectively making it harder for rsync to find checksums that are the same between the server and client and this means it needs to transfer more data. When you uncompress it you're letting rsync do what it's good at, send less data to sync the two files.
0 comments:
Post a Comment