Brittle deltas – a possible solution?

Deltarpm is brittle. When it works correctly, it’s brilliant. But, like a tightrope walker crossing the Niagara falls while balancing an egg on his head, all it takes is one slip and…*splat*.

At the beginning of the Fedora 15 release cycle, a new version of xz was pushed in which the defaults for compression level 3 were changed (as far as I can tell, to what used to be level 4). This doesn’t cause any problems for newly compressed data, but if you decompress an rpm whose payload was compressed using old level 3 (like makedeltarpm does) and then recompress it with new level 3 (like applydeltarpm does), the compressed files no longer match. *Splat*.

I wrote about the root problem here over a year ago, but to summarize: almost no compression algorithms ever guarantee that, over all releases, they will create the same compressed output given the same uncompressed input.

Our fix for Fedora 15 was pretty simple. Delete all of the old deltarpms in Rawhide. As long as the users have the new xz before doing a yum update, all new deltarpms will work correctly. Yay.

The problem is that this is all still extremely fragile. Take Fedora bugs #524720, #548523, and #677578 for example. All three bugs have cropped up because of mistakes in handling changes in the compression format, and it’s all a bit ridiculous. Would anyone use gzip if an old version couldn’t decompress data compressed with a newer version?

A possible solution?

There is no simple solution. So what if we change the rules? Instead of trying to keep the compression algorithms static, what if we stored just enough information in the deltas to recompress using the exact same settings, whatever they are.

For gzip, this would mean recording things like each block size, dictionary, etc. For xz, it would mean recording the LZMA2 settings. The problem is that this information is different for each compression type and the functions to extract the needed information haven’t been included in any compression libraries (to my knowledge).

However, if we could write these functions and get them into the upstream libraries, it would benefit all programs that try to generate deltas. Deltarpm would continue to work when compression algorithms change. Rsync could actually delta gzipped files, even if the “–rsyncable” switch hasn’t been used in gzip.

There are a couple of possible problems with this solution. First, I’m not sure how big the extra needed information is. Obviously, for each compression format, it’s different, but, unless it’s at most 1/100th the size of the uncompressed file, storing the extra data in the deltarpm will probably not be worth the effort.

Second, no code has actually been written. In an open source world of “Show me the code”, this is obviously a major issue. I’d love to do a reference for one of the simpler compression formats (like zlib), but just haven’t had the time yet.

Obviously, the best solution would be for the various upstreams to provide the necessary functions, as they understand both their algorithms and what information should be stored. However, most upstreams have enough on their plates without needing extra stuff thrown in from random blogs.

Another good solution would be for someone who is interested in deltas and compression to take on this project themselves. Any volunteers? 🙂

Broken eggs credit: Broken Eggs by kyle tsui. Used under CC BY-NC-ND


Comments

Andre Robatino
Wednesday, Feb 16, 2011

The problem is magnified with deltaisos, since each contains a large number of deltarpms, every one of which must rebuild properly. And one can make useful disos between ISOs which are far apart - for example, in Fedora, a diso from (N-1) Final to N Final is typically around half the size of the full ISO. But this makes it much more likely that a compression change will happen over this period - Final releases happen twice a year, so if a compression change happens once every 2 years, about 1/4 of these disos are potentially unusable, which means they are probably too unreliable at this point to catch on. Which is a shame, since someone who now downloads the full DVD every release could potentially cut this download in half.