Ideas for ra_serf pristine-downloading-optimisation to overcome SHA1 collisions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Ideas for ra_serf pristine-downloading-optimisation to overcome SHA1 collisions

Johan Corveleyn-3
Not super-relevant in the short term (since it seems we are going in
the direction of always rejecting SHA-1 collisions to enter the
repository) but maybe useful in the future if we would ever support
sha1 collisions: on irc some ideas were floated for overcoming the
SHA1 dependence of ra_serf's pristine downloading optimisation.

Summarizing the discussion here on dev@ so the ideas don't get lost ...
(removed a cross-cutting discussion about the usefulness of only
disabling the pristine-downloading-optimisation while the working copy
can not yet handle collisions -- which ended in the suggestion to
simply always reject sha1 collisions on the server)

[[[
<stsp> jcorvel, a better fix would be to design a better mechanism
that allows the server to know what the client already has

<jcorvel> stsp: yes, but I wonder how that could be done. If the
client and server only exchange "partial information" (a hash), you're
always going to get this problem eventually I guess

<stsp> jcorvel, well, if the server had some view on the contents of
the pristine store, it could work

<danielsh> Perhaps a MAC could help here?

<stsp> or even small "do you have file@N" - "yes" / "no" responses,
rather than sending full content outright could help for large files

<danielsh> I was thinking, instead of using sha1, to use a MAC
(message authentication code)

<danielsh> a MAC is basically a keyed hash

<danielsh> The client and server could dynamically agree on a key,
then the uploadee would compute the MACs of what it has under the
ephemeral key ...

<danielsh> ("uploadee" being the client in a checkout operation, the
server in a commit operation)

<jcorvel> stsp: file@N as a "pristine-identifier" won't cut it I think
-- the feature is designed to avoid sending the same content even it's
a totally different file (e.g. pulling in a different branch into a
sparse wc of root-of-repos)

<stsp> yes the question is then which file@N do you have? :)

<danielsh> I was thinking the MAC's key could serve as salt to prevent
collisions... but this would only be efficient if the client didn't
have to MAC too many files, e.g., perhaps the client could MAC just
files that are part of the same node as what it's trying to download

<stsp> but i agree that mapping paths to content is hard in this case
without some extra key like a hash

<jcorvel> stsp: ah ok

<stsp> if the server knows that client has foo@N and wants to send
bar@M with same content, the server could send "use foo@N for bar@N"
for instance

<danielsh> either "part of the same node" (= share ancestry) or "same
basename" --- that's to handle cases like people versioning multiple
copies of openssl-1.0.tar.gz

<danielsh>  maybe the server could, when it's about to send foo@N,
tell the client what are all the bar@M / baz@O / qux@P pairs that have
the same sha1?

<stsp> or actually same content (based on the rep-cache, which is now safe)

<danielsh> The server should have this information basically for free,
if we remove the UNIQUE constraint on rep-cache.db

<stsp> oh yes we don't store SHA1-colliding content in the rep-cache yet

<stsp> that would need to be done as well

<stsp> anyway, it's just an idea

<jcorvel> it could be a long list of names the server needs to send.
For instance we make a release of 10+ products every week. Several
build along the way. Each are tagged. I.e. at any point fileX has
perhaps 1000 or 10.000 identical siblings somewhere in HEAD

<danielsh> jcorvel, the reporter tells the server what the client has;
the list could be trimmed to that part of the history?

<jcorvel> hm, okay, I guess

<jcorvel> I'm thinking maybe there is a way to buy us some time to
keep using this feature for another couple of years, by making
collision less likely?

<jcorvel> for instance by using the sha1 + md5 + whatever else
information we have for free

<danielsh> adding hashes doesn't help; using a longer hash is better

<danielsh> see the .pdf I posted to dev@ / private@

<danielsh> maybe salt, but I fear it's going to be have fixed per
repository or so, which negates most of the benefit

<jcorvel> danielsh: okay, it might not help forever, but I'm trying to
think of something we can already use with the information we have
stored on both client and server side (new hashes would require new wc
format -- costly 'svn upgrade' etc)

<danielsh> jcorvel, the shattered collision has only one or two blocks
differing...

<danielsh> md5 might help for the shattered[12].pdf case but in
general it's as easy to collide sha1+md5 simultaneously as to collide
only sha1
]]]

--
Johan