Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Bert Huijben-5


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: maandag 5 september 2016 13:33
> To: [hidden email]
> Subject: svn commit: r1759233 -
> /subversion/trunk/subversion/libsvn_wc/questions.c
>
> Author: ivan
> Date: Mon Sep  5 11:32:54 2016
> New Revision: 1759233
>
> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> Log:
> Use SHA-1 checksum to find whether files are actually modified in working
> copy if timestamps don't match.
>
> Before this change we were doing this:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 2. Open pristine file.
> 3. Read properties from wc.db and find whether translation is required.
> 4. Compare filesize with pristine filesize for files that do not
>    require translation. Assume that file is modified if the sizes differ.
> 5. Compare detranslated contents of working file with pristine.
>
> Now behavior is the following:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 3. Read properties from wc.db and find whether translation is required.
> 3. Compare filesize with pristine filesize for files that do not
>    require translation. Assume that file is modified if the sizes differ.
> 4. Calculate SHA-1 checksum of detranslated contents of working file
>    and compare it with pristine's checksum stored in wc.db.

We looked at this before, and this change has pro-s and con-s, depending on specific use cases.

With the compare to SHA we only have to read the new file, but we always have to read the file 100%.

With the older system we could bail on the first detected change.

If there is a change somewhere both systems read on average 100% of the filesize... only if there is no actual change except for the timestamp, the new system is less expensive.


If the file happens to be a database file or something similar there is quite commonly a change in the first 'block', when there are changes somewhere later on. (Checksum, change counter, etc.). File formats like sqlite were explicitly designed for this (and other cheap checks), with a change counter at the start.


I don't think we should 'just change behavior' here, if we don't have actual usage numbers for our users. Perhaps we should make this feature configurable... or depending on filesize.



We certainly want the new behavior for non-pristine working copies (on the IDEA list for years), but I'm not sure if we always want this behavior as only option.



This mail is partially, to just discuss this topic on the list, to make sure everybody knows what happened here and why.



        Bert

(Note that it is labor day in the USA today... so I don't expect many responses until later this week)

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Markus Schaber
Hi,

From: Bert Huijben [mailto:[hidden email]]

> From: [hidden email] [mailto:[hidden email]]
> > Sent: maandag 5 september 2016 13:33
> > To: [hidden email]
> > Subject: svn commit: r1759233 -
> > /subversion/trunk/subversion/libsvn_wc/questions.c
> >
> > Author: ivan
> > Date: Mon Sep  5 11:32:54 2016
> > New Revision: 1759233
> >
> > URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> > Log:
> > Use SHA-1 checksum to find whether files are actually modified in
> > working copy if timestamps don't match.
> >
> > Before this change we were doing this:
> > 1. Compare file timestamps: if they match, assume that files didn't change.
> > 2. Open pristine file.
> > 3. Read properties from wc.db and find whether translation is required.
> > 4. Compare filesize with pristine filesize for files that do not
> >    require translation. Assume that file is modified if the sizes differ.
> > 5. Compare detranslated contents of working file with pristine.
> >
> > Now behavior is the following:
> > 1. Compare file timestamps: if they match, assume that files didn't change.
> > 3. Read properties from wc.db and find whether translation is required.
> > 3. Compare filesize with pristine filesize for files that do not
> >    require translation. Assume that file is modified if the sizes differ.
> > 4. Calculate SHA-1 checksum of detranslated contents of working file
> >    and compare it with pristine's checksum stored in wc.db.
>
> We looked at this before, and this change has pro-s and con-s, depending on
> specific use cases.
>
> With the compare to SHA we only have to read the new file, but we always have
> to read the file 100%.
>
> With the older system we could bail on the first detected change.
>
> If there is a change somewhere both systems read on average 100% of the
> filesize... only if there is no actual change except for the timestamp, the
> new system is less expensive.>
>
> If the file happens to be a database file or something similar there is quite
> commonly a change in the first 'block', when there are changes somewhere
> later on. (Checksum, change counter, etc.). File formats like sqlite were
> explicitly designed for this (and other cheap checks), with a change counter
> at the start.

Maybe we could cache a checksum of the first block (4k) of each file in the wc.db, to profit from those changes. Thus, we only need to read the whole file if the first block checksum is equal.

> I don't think we should 'just change behavior' here, if we don't have actual
> usage numbers for our users. Perhaps we should make this feature
> configurable... or depending on filesize.
>
> We certainly want the new behavior for non-pristine working copies (on the
> IDEA list for years), but I'm not sure if we always want this behavior as
> only option.
>
> This mail is partially, to just discuss this topic on the list, to make sure
> everybody knows what happened here and why.
>
>
>
> Bert
>
> (Note that it is labor day in the USA today... so I don't expect many
> responses until later this week)


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: [hidden email] | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Ivan Zhakov-2
In reply to this post by Bert Huijben-5
On 5 September 2016 at 14:46, Bert Huijben <[hidden email]> wrote:

>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
>> Sent: maandag 5 september 2016 13:33
>> To: [hidden email]
>> Subject: svn commit: r1759233 -
>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>
>> Author: ivan
>> Date: Mon Sep  5 11:32:54 2016
>> New Revision: 1759233
>>
>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>> Log:
>> Use SHA-1 checksum to find whether files are actually modified in working
>> copy if timestamps don't match.
>>
>> Before this change we were doing this:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 2. Open pristine file.
>> 3. Read properties from wc.db and find whether translation is required.
>> 4. Compare filesize with pristine filesize for files that do not
>>    require translation. Assume that file is modified if the sizes differ.
>> 5. Compare detranslated contents of working file with pristine.
>>
>> Now behavior is the following:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 3. Read properties from wc.db and find whether translation is required.
>> 3. Compare filesize with pristine filesize for files that do not
>>    require translation. Assume that file is modified if the sizes differ.
>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>    and compare it with pristine's checksum stored in wc.db.
>
Hi Bert,

> We looked at this before, and this change has pro-s and con-s, depending on specific use cases.
>
Thanks for bringing this to dev@ list, I was not aware that this topic
was discussed before.

> With the compare to SHA we only have to read the new file, but we
> always have to read the file 100%.
>
> With the older system we could bail on the first detected change.
>
I considered this trade off. See below.

> If there is a change somewhere both systems read on average
> 100% of the filesize... only if there is no actual change except
> for the timestamp, the new system is less expensive.
>
As far I understand the average characteristics are:
1. Files are equal:
   a) old behavior: 100% read of working file + 100% read of pristine
file = 200% of working file size.
   b) new behavior: 100% read of working file = 100% of working file size.

2. Files modified (but has the same size or require translation (!)):
   a) old behavior: 50% (average) read of working file + 50% (average)
read of pristine file = 100% of working file size.
   b) new behavior: 100% read of working file = 100% of working file size.

(Strictly speaking, average read size would also depend on the number
of modifications, and it could be less than 50%.)

Also libsvn_wc checks working file size for files that doesn't require
translation, before comparing contents. And keyword expansion/newline
translation doesn't make sense for binary files (like database, pdf,
docx).  And for most binary files format modification involves
changing its size.

(There were problem in old behavior because pristine file was opened
*before* comparing working file size. Fixing that would require
additional SQLite operation.)

> If the file happens to be a database file or something similar
> there is quite commonly a change in the first 'block', when
> there are changes somewhere later on. (Checksum, change
> counter, etc.). File formats like sqlite were explicitly designed
> for this (and other cheap checks), with a change counter at the start.

> I don't think we should 'just change behavior' here, if we don't
> have actual usage numbers for our users. Perhaps we should make
> this feature configurable... or depending on filesize.
>

Let me summarize all possible cases that I considered before my
change. First of all some definitions:
* Text file (T) -- text file that require translation, due to eol
style or keywords expansion
* Text file (N) -- text file that doesn't require translation
* Binary file -- some kind of binary file (database, pdf, zip, docx).
Let's assume that user doesn't configure svn:eol-style and
svn:keywords for them.
* WS -- size of working file
* PS -- size of pristine file

* Old=xxx -- average required read size for old behavior in terms of
working and pristine file sizes
* New=xxx -- average required read size for new behavior in terms of
working and pristine file sizes

1. Text file (T), not modified:  Old = WS + PS, New = WS
2. Text file (N), not modified:  Old = WS + PS, New = WS
3. Binary file, not modified:  Old = WS + PS, New = WS
4. Text file (T), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
5. Text file (N), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
6. Binary file, modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
7. Text file (T), modified, different size:  Old = 0.5 * WS + 0.5 * PS, New = WS
8. Text file (N), modified, different size:  Old = 0, New = 0
9. Binary file, modified, different size:  Old = 0, New = 0

(There is some overhead for SHA1 calculation: SHA1 performance is
about 200-500 MB/s, but currently it's out of scope)

With all above the new behavior should be working better or the same
in all cases. I agree that 50% approximation may be incorrect for some
specific binary formats (case 6) like sqlite db.

> We certainly want the new behavior for non-pristine working copies
> (on the IDEA list for years), but I'm not sure if we always want this
> behavior as only option.
>
> This mail is partially, to just discuss this topic on the list, to make sure everybody knows what happened here and why.

--
Ivan Zhakov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Stefan Hett-2
On 9/5/2016 6:23 PM, Ivan Zhakov wrote:
> With all above the new behavior should be working better or the same
> in all cases. I agree that 50% approximation may be incorrect for some
> specific binary formats (case 6) like sqlite db.
To be fair, I'd argue that in case of binary file modifications the
approximation is quite off. Most binary formats (if not all) in our
repository differ in the first couple of bytes (if they were changed)
and therefore it's quite a significant difference whether we read the
full file contents of a single file (which might be >100MB) or just the
first few bytes of two files.

As Bert already suggested, I totally support the statement that it's
quite a common design pattern for binary formats to have some checksum,
time stamp, counter value, filesize record, etc. at the beginning of the
file contents which is likely to differ, if the file has changed. If you
then take the file sizes differences between text files and binary files
into account (aka: text files usually being quite small, while binary
files usually being quite large) it certainly has the potential to
matter quite much that there's a difference expected for the binary file
comparison case.

FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k
block and another for the full file) sounds therefore as a reasonable
suggestion.

Last but not least the throughput of calculating the SHA-1 is also
restricted by the I/O throughput in practice. For working directories
I'd assume it's not too unlikely to still reside on some HDD (rather
than some faster cache or an SSD) so it'd be limited to around 20 MB/s
in practice. Given large binary files this might pose a significant
difference in certain (not uncommon) use-cases.

Don't get this wrong: IMHO I agree that the SHA-1 approach is superior
(especially on Windows machines since it will reduce the cases where two
files have to be opened - pointer: anti virus scanner impacts). I just
share Bert's opinion here that the approach should be a bit improved
especially in light of binary file support.

If it would be of any help, I could do some performance measurements
with the two approaches on our repository to get some real world numbers
to work with.

--
Regards,
Stefan Hett

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Branko Čibej
On 05.09.2016 19:09, Stefan Hett wrote:

> On 9/5/2016 6:23 PM, Ivan Zhakov wrote:
>> With all above the new behavior should be working better or the same
>> in all cases. I agree that 50% approximation may be incorrect for some
>> specific binary formats (case 6) like sqlite db.
> To be fair, I'd argue that in case of binary file modifications the
> approximation is quite off. Most binary formats (if not all) in our
> repository differ in the first couple of bytes (if they were changed)
> and therefore it's quite a significant difference whether we read the
> full file contents of a single file (which might be >100MB) or just
> the first few bytes of two files.
>
> As Bert already suggested, I totally support the statement that it's
> quite a common design pattern for binary formats to have some
> checksum, time stamp, counter value, filesize record, etc. at the
> beginning of the file contents which is likely to differ, if the file
> has changed. If you then take the file sizes differences between text
> files and binary files into account (aka: text files usually being
> quite small, while binary files usually being quite large) it
> certainly has the potential to matter quite much that there's a
> difference expected for the binary file comparison case.
>
> FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k
> block and another for the full file) sounds therefore as a reasonable
> suggestion.
>
> Last but not least the throughput of calculating the SHA-1 is also
> restricted by the I/O throughput in practice. For working directories
> I'd assume it's not too unlikely to still reside on some HDD (rather
> than some faster cache or an SSD) so it'd be limited to around 20 MB/s
> in practice. Given large binary files this might pose a significant
> difference in certain (not uncommon) use-cases.
>
> Don't get this wrong: IMHO I agree that the SHA-1 approach is superior
> (especially on Windows machines since it will reduce the cases where
> two files have to be opened - pointer: anti virus scanner impacts). I
> just share Bert's opinion here that the approach should be a bit
> improved especially in light of binary file support.
>
> If it would be of any help, I could do some performance measurements
> with the two approaches on our repository to get some real world
> numbers to work with.


We discussed the two approaches to death years ago and decided to keep
the behaviour prior to r1759233, partly because we had no evidence one
way or the other. Changing the behaviour without having performance
measurements in hand doesn't seem like a really smart move to me ... so
yes, every little bit of extra data would surely help.

BTW, I strongly suggest not using the Subversion tree as a test case,
it's mostly small text files and so not very representative.


-- Brane
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Ivan Zhakov-2
In reply to this post by Ivan Zhakov-2
On 5 September 2016 at 19:23, Ivan Zhakov <[hidden email]> wrote:

> On 5 September 2016 at 14:46, Bert Huijben <[hidden email]> wrote:
>>> -----Original Message-----
>>> From: [hidden email] [mailto:[hidden email]]
>>> Sent: maandag 5 september 2016 13:33
>>> To: [hidden email]
>>> Subject: svn commit: r1759233 -
>>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>>
>>> Author: ivan
>>> Date: Mon Sep  5 11:32:54 2016
>>> New Revision: 1759233
>>>
>>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>>> Log:
>>> Use SHA-1 checksum to find whether files are actually modified in working
>>> copy if timestamps don't match.
>>>
>>> Before this change we were doing this:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 2. Open pristine file.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 4. Compare filesize with pristine filesize for files that do not
>>>    require translation. Assume that file is modified if the sizes differ.
>>> 5. Compare detranslated contents of working file with pristine.
>>>
>>> Now behavior is the following:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 3. Compare filesize with pristine filesize for files that do not
>>>    require translation. Assume that file is modified if the sizes differ.
>>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>>    and compare it with pristine's checksum stored in wc.db.
>>
> Hi Bert,
>
>> We looked at this before, and this change has pro-s and con-s, depending on specific use cases.
>>
> Thanks for bringing this to dev@ list, I was not aware that this topic
> was discussed before.
>
[...]

>> If the file happens to be a database file or something similar
>> there is quite commonly a change in the first 'block', when
>> there are changes somewhere later on. (Checksum, change
>> counter, etc.). File formats like sqlite were explicitly designed
>> for this (and other cheap checks), with a change counter at the start.
>
>> I don't think we should 'just change behavior' here, if we don't
>> have actual usage numbers for our users. Perhaps we should make
>> this feature configurable... or depending on filesize.
>>
>
> Let me summarize all possible cases that I considered before my
> change. First of all some definitions:
> * Text file (T) -- text file that require translation, due to eol
> style or keywords expansion
> * Text file (N) -- text file that doesn't require translation
> * Binary file -- some kind of binary file (database, pdf, zip, docx).
> Let's assume that user doesn't configure svn:eol-style and
> svn:keywords for them.
> * WS -- size of working file
> * PS -- size of pristine file
>
> * Old=xxx -- average required read size for old behavior in terms of
> working and pristine file sizes
> * New=xxx -- average required read size for new behavior in terms of
> working and pristine file sizes
>
> 1. Text file (T), not modified:  Old = WS + PS, New = WS
> 2. Text file (N), not modified:  Old = WS + PS, New = WS
> 3. Binary file, not modified:  Old = WS + PS, New = WS
> 4. Text file (T), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 5. Text file (N), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 6. Binary file, modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 7. Text file (T), modified, different size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 8. Text file (N), modified, different size:  Old = 0, New = 0
> 9. Binary file, modified, different size:  Old = 0, New = 0
>
Hi Bert,

I tested several different binary file formats for no-op/minimal change:
1. Microsoft Word (docx): change single character at the end of document:
   - filesize changes (case 9)
   - first change at offset 2,295 of 233,323
2. Microsoft Word (doc): change single character at the end of document:
   - filesize didn't change (case 6)
   - first change at offset 540 of 479,232
3. sqlite database: insert one row to wc.db (2.5mb)
   - filesize didn't change (case 6)
   - first change at offset 27
4. zip archive: change single character in one of many text files (43
mb uncompressed)
   - filesize changes (case 9)
   - first change at offset 7,182,933 of 10,352,080
5. pdf file: no-op change of 800kb file
   - filesize changes (case 9)
   - first change at offset 47 of 854,971
6. Photoshop image (psd): change one pixel in the middle
   - filesize changes (case 9)
   - first change at offset 32 of 69,615,507

With this in mind, I think we can improve the current approach so that
in would be better in all possible cases. We could do this:
1. Open pristine file
2. Read first 4-16kb from pristine and normalized working file
3. Compare them: if they are equal then close pristine file, calculate
SHA1 of normalized working file and compare it with checksum in wc.db.

This behavior would only apply for only for files larger than some
threshold (e.g 1mb) to make performance penalty for opening pristine
file negligible.

What do you think?

--
Ivan Zhakov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Bert Huijben-5
This code is still in trunk without any of the discussed improvements, so this change is currently part of 1.10.0-alpha1.

If we don't implement the improvements I think we should check if we want to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.

See discussion below

    Bert

On Thu, Sep 8, 2016 at 5:42 PM, Ivan Zhakov <[hidden email]> wrote:
On 5 September 2016 at 19:23, Ivan Zhakov <[hidden email]> wrote:
> On 5 September 2016 at 14:46, Bert Huijben <[hidden email]> wrote:
>>> -----Original Message-----
>>> From: [hidden email] [mailto:[hidden email]]
>>> Sent: maandag 5 september 2016 13:33
>>> To: [hidden email]
>>> Subject: svn commit: r1759233 -
>>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>>
>>> Author: ivan
>>> Date: Mon Sep  5 11:32:54 2016
>>> New Revision: 1759233
>>>
>>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>>> Log:
>>> Use SHA-1 checksum to find whether files are actually modified in working
>>> copy if timestamps don't match.
>>>
>>> Before this change we were doing this:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 2. Open pristine file.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 4. Compare filesize with pristine filesize for files that do not
>>>    require translation. Assume that file is modified if the sizes differ.
>>> 5. Compare detranslated contents of working file with pristine.
>>>
>>> Now behavior is the following:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 3. Compare filesize with pristine filesize for files that do not
>>>    require translation. Assume that file is modified if the sizes differ.
>>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>>    and compare it with pristine's checksum stored in wc.db.
>>
> Hi Bert,
>
>> We looked at this before, and this change has pro-s and con-s, depending on specific use cases.
>>
> Thanks for bringing this to dev@ list, I was not aware that this topic
> was discussed before.
>
[...]

>> If the file happens to be a database file or something similar
>> there is quite commonly a change in the first 'block', when
>> there are changes somewhere later on. (Checksum, change
>> counter, etc.). File formats like sqlite were explicitly designed
>> for this (and other cheap checks), with a change counter at the start.
>
>> I don't think we should 'just change behavior' here, if we don't
>> have actual usage numbers for our users. Perhaps we should make
>> this feature configurable... or depending on filesize.
>>
>
> Let me summarize all possible cases that I considered before my
> change. First of all some definitions:
> * Text file (T) -- text file that require translation, due to eol
> style or keywords expansion
> * Text file (N) -- text file that doesn't require translation
> * Binary file -- some kind of binary file (database, pdf, zip, docx).
> Let's assume that user doesn't configure svn:eol-style and
> svn:keywords for them.
> * WS -- size of working file
> * PS -- size of pristine file
>
> * Old=xxx -- average required read size for old behavior in terms of
> working and pristine file sizes
> * New=xxx -- average required read size for new behavior in terms of
> working and pristine file sizes
>
> 1. Text file (T), not modified:  Old = WS + PS, New = WS
> 2. Text file (N), not modified:  Old = WS + PS, New = WS
> 3. Binary file, not modified:  Old = WS + PS, New = WS
> 4. Text file (T), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 5. Text file (N), modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 6. Binary file, modified, same size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 7. Text file (T), modified, different size:  Old = 0.5 * WS + 0.5 * PS, New = WS
> 8. Text file (N), modified, different size:  Old = 0, New = 0
> 9. Binary file, modified, different size:  Old = 0, New = 0
>
Hi Bert,

I tested several different binary file formats for no-op/minimal change:
1. Microsoft Word (docx): change single character at the end of document:
   - filesize changes (case 9)
   - first change at offset 2,295 of 233,323
2. Microsoft Word (doc): change single character at the end of document:
   - filesize didn't change (case 6)
   - first change at offset 540 of 479,232
3. sqlite database: insert one row to wc.db (2.5mb)
   - filesize didn't change (case 6)
   - first change at offset 27
4. zip archive: change single character in one of many text files (43
mb uncompressed)
   - filesize changes (case 9)
   - first change at offset 7,182,933 of 10,352,080
5. pdf file: no-op change of 800kb file
   - filesize changes (case 9)
   - first change at offset 47 of 854,971
6. Photoshop image (psd): change one pixel in the middle
   - filesize changes (case 9)
   - first change at offset 32 of 69,615,507

With this in mind, I think we can improve the current approach so that
in would be better in all possible cases. We could do this:
1. Open pristine file
2. Read first 4-16kb from pristine and normalized working file
3. Compare them: if they are equal then close pristine file, calculate
SHA1 of normalized working file and compare it with checksum in wc.db.

This behavior would only apply for only for files larger than some
threshold (e.g 1mb) to make performance penalty for opening pristine
file negligible.

What do you think?

--
Ivan Zhakov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Stefan Sperling
On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
> This code is still in trunk without any of the discussed improvements, so
> this change is currently part of 1.10.0-alpha1.
>
> If we don't implement the improvements I think we should check if we want
> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
>
> See discussion below
>
>     Bert

I think the proposed approach as implemented on trunk can no longer be
considered viable, unfortunately, because of this step:

> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
> > >>>    and compare it with pristine's checksum stored in wc.db.

Given that the SHA1 collision problem is real, we are now trying to stop
relying on hashes to compare content. So it does not make sense to add
new code which relies on hashes in this way, in my opinion.

It seems that using SHA1 to compare content is key to the proposed approach.
If that is correct, then I don't agree with releasing 1.10 with this feature
and I would be in favour of reverting this change.

Ivan, do you have any further comments on this thread? You have remained
silent for quite some time now :(
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Daniel Shahaf-2
Stefan Sperling wrote on Tue, Apr 04, 2017 at 11:33:18 +0200:

> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
> > This code is still in trunk without any of the discussed improvements, so
> > this change is currently part of 1.10.0-alpha1.
> >
> > If we don't implement the improvements I think we should check if we want
> > to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
> >
> > See discussion below
> >
> >     Bert
>
> I think the proposed approach as implemented on trunk can no longer be
> considered viable, unfortunately, because of this step:
>
> > > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
> > > >>>    and compare it with pristine's checksum stored in wc.db.
>
> Given that the SHA1 collision problem is real, we are now trying to stop
> relying on hashes to compare content. So it does not make sense to add
> new code which relies on hashes in this way, in my opinion.

Note that we can still do what rep-cache now does: treat equality of
checksums as a sensitive but not specific test for bit-for-bit equality;
that is: checksum the file, and if the resulting value is equal to the
stored value, do a full-text comparison too.

I'm not saying that this is what libsvn_wc should do; I'm just pointing
out that it is still possible to use sha1 safely.

Cheers,

Daniel

> It seems that using SHA1 to compare content is key to the proposed approach.
> If that is correct, then I don't agree with releasing 1.10 with this feature
> and I would be in favour of reverting this change.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Stefan Sperling
On Tue, Apr 04, 2017 at 10:00:31AM +0000, Daniel Shahaf wrote:
> Note that we can still do what rep-cache now does: treat equality of
> checksums as a sensitive but not specific test for bit-for-bit equality;
> that is: checksum the file, and if the resulting value is equal to the
> stored value, do a full-text comparison too.

Yeah, that sounds OK.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Markus Schaber
Hi,

> -----Original Message-----
> From: Stefan Sperling [mailto:[hidden email]]
> Sent: Tuesday, April 04, 2017 12:19 PM
> To: Daniel Shahaf
> Cc: [hidden email]
> Subject: Re: Check SHA vs Content (was: RE: svn commit: r1759233 -
> /subversion/trunk/subversion/libsvn_wc/questions.c)
>
> On Tue, Apr 04, 2017 at 10:00:31AM +0000, Daniel Shahaf wrote:
> > Note that we can still do what rep-cache now does: treat equality of
> > checksums as a sensitive but not specific test for bit-for-bit
> > equality; that is: checksum the file, and if the resulting value is
> > equal to the stored value, do a full-text comparison too.
>
> Yeah, that sounds OK.

INHO, it's also still sensible to use SHA1 (or even MD5) as a "technical" checksum to protect against problems like transmission errors, storage failure, random bit flips or accidental changing of pristine copies.

It's just that we must not use SHA1 as a sole means of identifying content or content identity.


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: [hidden email] | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Johan Corveleyn-3
In reply to this post by Stefan Sperling
On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling <[hidden email]> wrote:

> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
>> This code is still in trunk without any of the discussed improvements, so
>> this change is currently part of 1.10.0-alpha1.
>>
>> If we don't implement the improvements I think we should check if we want
>> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
>>
>> See discussion below
>>
>>     Bert
>
> I think the proposed approach as implemented on trunk can no longer be
> considered viable, unfortunately, because of this step:
>
>> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>> > >>>    and compare it with pristine's checksum stored in wc.db.
>
> Given that the SHA1 collision problem is real, we are now trying to stop
> relying on hashes to compare content. So it does not make sense to add
> new code which relies on hashes in this way, in my opinion.
>
> It seems that using SHA1 to compare content is key to the proposed approach.
> If that is correct, then I don't agree with releasing 1.10 with this feature
> and I would be in favour of reverting this change.
>
> Ivan, do you have any further comments on this thread? You have remained
> silent for quite some time now :(

Where are we with this? Seems the consensus is to revert r1759233 to
not further increase our reliance on sha1? Or is there still a way to
keep r1759233 in some way, and improve it to make the sha1 test
"sensitive but not specific", like danielsh proposed?

Ivan?

--
Johan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Johan Corveleyn-3
On Tue, May 9, 2017 at 1:21 PM, Johan Corveleyn <[hidden email]> wrote:

> On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling <[hidden email]> wrote:
>> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
>>> This code is still in trunk without any of the discussed improvements, so
>>> this change is currently part of 1.10.0-alpha1.
>>>
>>> If we don't implement the improvements I think we should check if we want
>>> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
>>>
>>> See discussion below
>>>
>>>     Bert
>>
>> I think the proposed approach as implemented on trunk can no longer be
>> considered viable, unfortunately, because of this step:
>>
>>> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>> > >>>    and compare it with pristine's checksum stored in wc.db.
>>
>> Given that the SHA1 collision problem is real, we are now trying to stop
>> relying on hashes to compare content. So it does not make sense to add
>> new code which relies on hashes in this way, in my opinion.
>>
>> It seems that using SHA1 to compare content is key to the proposed approach.
>> If that is correct, then I don't agree with releasing 1.10 with this feature
>> and I would be in favour of reverting this change.
>>
>> Ivan, do you have any further comments on this thread? You have remained
>> silent for quite some time now :(
>
> Where are we with this? Seems the consensus is to revert r1759233 to
> not further increase our reliance on sha1? Or is there still a way to
> keep r1759233 in some way, and improve it to make the sha1 test
> "sensitive but not specific", like danielsh proposed?
>
> Ivan?

Hmmm, I'm wondering, now we decided to disallow sha1 collisions to
enter the repository, how does this reflect on this discussion?

Okay, it's another "collision-sensitive-dependency" on sha1, but there
are others in the working copy (pristines) and ra_serf, and since
we're assuming a non-sha1-collision world (by blocking them from the
repository), it makes little sense to me to simply revert this, and
not fix other sha1-dependencies. OTOH, if we want to gradually reduce
our dependence on sha1, this is an easy one to remove, since it's not
in production yet ... Dunno.

However, there were also other, performance-related, objections to the
new implementation. Ivan proposed improvements [1] but they were never
implemented.

So, what to do?

Unless someone objects in the coming couple of days, I'd say we revert r1759233.

[1] https://svn.haxx.se/dev/archive-2016-09/0016.shtml

--
Johan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content

Branko Čibej
On 22.05.2017 22:56, Johan Corveleyn wrote:

> On Tue, May 9, 2017 at 1:21 PM, Johan Corveleyn <[hidden email]> wrote:
>> On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling <[hidden email]> wrote:
>>> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
>>>> This code is still in trunk without any of the discussed improvements, so
>>>> this change is currently part of 1.10.0-alpha1.
>>>>
>>>> If we don't implement the improvements I think we should check if we want
>>>> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
>>>>
>>>> See discussion below
>>>>
>>>>     Bert
>>> I think the proposed approach as implemented on trunk can no longer be
>>> considered viable, unfortunately, because of this step:
>>>
>>>>>>>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>>>>>>>    and compare it with pristine's checksum stored in wc.db.
>>> Given that the SHA1 collision problem is real, we are now trying to stop
>>> relying on hashes to compare content. So it does not make sense to add
>>> new code which relies on hashes in this way, in my opinion.
>>>
>>> It seems that using SHA1 to compare content is key to the proposed approach.
>>> If that is correct, then I don't agree with releasing 1.10 with this feature
>>> and I would be in favour of reverting this change.
>>>
>>> Ivan, do you have any further comments on this thread? You have remained
>>> silent for quite some time now :(
>> Where are we with this? Seems the consensus is to revert r1759233 to
>> not further increase our reliance on sha1? Or is there still a way to
>> keep r1759233 in some way, and improve it to make the sha1 test
>> "sensitive but not specific", like danielsh proposed?
>>
>> Ivan?
> Hmmm, I'm wondering, now we decided to disallow sha1 collisions to
> enter the repository, how does this reflect on this discussion?
>
> Okay, it's another "collision-sensitive-dependency" on sha1, but there
> are others in the working copy (pristines) and ra_serf, and since
> we're assuming a non-sha1-collision world (by blocking them from the
> repository), it makes little sense to me to simply revert this, and
> not fix other sha1-dependencies. OTOH, if we want to gradually reduce
> our dependence on sha1, this is an easy one to remove, since it's not
> in production yet ... Dunno.

The whole point of not allowing SHA1 collisions into the repository is
to protect the working copy. We can always remove such restrictions once
the working copy (and all protocols) can handle hash collisions. But
it's kind of hard to fix a working copy once it's been broken in this way.

-- Brane

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

Jacek Materna
In reply to this post by Johan Corveleyn-3
On Mon, May 22, 2017 at 10:56 PM, Johan Corveleyn <[hidden email]> wrote:

> On Tue, May 9, 2017 at 1:21 PM, Johan Corveleyn <[hidden email]> wrote:
>> On Tue, Apr 4, 2017 at 11:33 AM, Stefan Sperling <[hidden email]> wrote:
>>> On Mon, Feb 20, 2017 at 09:05:25AM +0100, Bert Huijben wrote:
>>>> This code is still in trunk without any of the discussed improvements, so
>>>> this change is currently part of 1.10.0-alpha1.
>>>>
>>>> If we don't implement the improvements I think we should check if we want
>>>> to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.
>>>>
>>>> See discussion below
>>>>
>>>>     Bert
>>>
>>> I think the proposed approach as implemented on trunk can no longer be
>>> considered viable, unfortunately, because of this step:
>>>
>>>> > >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>>> > >>>    and compare it with pristine's checksum stored in wc.db.
>>>
>>> Given that the SHA1 collision problem is real, we are now trying to stop
>>> relying on hashes to compare content. So it does not make sense to add
>>> new code which relies on hashes in this way, in my opinion.
>>>
>>> It seems that using SHA1 to compare content is key to the proposed approach.
>>> If that is correct, then I don't agree with releasing 1.10 with this feature
>>> and I would be in favour of reverting this change.
>>>
>>> Ivan, do you have any further comments on this thread? You have remained
>>> silent for quite some time now :(
>>
>> Where are we with this? Seems the consensus is to revert r1759233 to
>> not further increase our reliance on sha1? Or is there still a way to
>> keep r1759233 in some way, and improve it to make the sha1 test
>> "sensitive but not specific", like danielsh proposed?
>>
>> Ivan?
>
> Hmmm, I'm wondering, now we decided to disallow sha1 collisions to
> enter the repository, how does this reflect on this discussion?
>
> Okay, it's another "collision-sensitive-dependency" on sha1, but there
> are others in the working copy (pristines) and ra_serf, and since
> we're assuming a non-sha1-collision world (by blocking them from the
> repository), it makes little sense to me to simply revert this, and
> not fix other sha1-dependencies. OTOH, if we want to gradually reduce
> our dependence on sha1, this is an easy one to remove, since it's not
> in production yet ... Dunno.
>
> However, there were also other, performance-related, objections to the
> new implementation. Ivan proposed improvements [1] but they were never
> implemented.
>
> So, what to do?
>
> Unless someone objects in the coming couple of days, I'd say we revert r1759233.
>
> [1] https://svn.haxx.se/dev/archive-2016-09/0016.shtml
>
> --
> Johan

I agree in not adding more new logs to the fire. I also agree we
should attack the already-in-prod deps on SHA1 as well in the
background-
Loading...