extending the blame callback

classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

extending the blame callback

Stefan Kueng
Hi,

When running blame on an utf16 text file, the lines can not be used
since the blame callback only passes a 'char *' parameter which means it
ends at the first zero char. But actually, svn knows if the line has
more content.
So I'd like to propose the following patch which extends the blame
callback with a 'bytelength' parameter indicating how many bytes the
line consists of. That way, clients can themselves determine whether to
show the full line (e.g. using hex display) or maybe try to convert the
line string from different encodings.
Without the length parameter, clients can not exceed beyond the first
null char to determine what the line really consists of without risking
an access violation.

I've already tested this with TSVN and now I can properly blame utf16
files. While it's not straight forward to figure out how to get the real
uft16 line, it's possible. One problem is that every line except the
first has a zero char at the beginning (the leftover zero char after the
detected LF from the previous line), but I can work around that.

So, with this patch I'm able to do a blame on utf16 files, so I'd like
to commit this if there are no objections.

[[[
Extend the blame callback with a string length parameter.

* subversion/incluce/svn_client.h
* subversion/libsvn_client/blame.c
   (svn_client_blame_receiver4_t): typedef for new callback
   (svn_client_blame6): new API using the svn_client_blame_receiver4_t
callback
* subversion/libsvn_client/deprecated.c
   (svn_client_blame5): moved API there, calling svn_client_blame6 using a
                        callback shim
   (blame_wrapper_receiver3): callback shim for svn_client_blame5

]]]

blame6.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Julian Foad-5
Stefan Kueng wrote:
> When running blame on an utf16 text file, the lines can not be used
> since the blame callback only passes a 'char *' parameter which means it
> ends at the first zero char. But actually, svn knows if the line has
> more content.

+1 on doing something to fix the problem.

Just briefly:
* we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;
* splitting the lines on a bare LF byte, and having a left-over zero byte, sounds just so wrong... it can't be right, can it? Can't we do something better? Use a callback for deciding how to split lines, or something? I haven't looked at the code, so not sure what exactly.

(I can't review the patch in more detail just now.)

--
- Julian
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Mark Phippard-3
In reply to this post by Stefan Kueng

> On Jan 5, 2019, at 10:58 AM, Stefan Kueng <[hidden email]> wrote:
>
> So, with this patch I'm able to do a blame on utf16 files, so I'd like to commit this if there are no objections.
>

I have no objections, just a question. I thought SVN treated utf16 files as binary.  How does blame work at all?  I thought diff does not even work for utf16.  Does this require setting svn:mime-type to text/plain or something like that as well?

Mark
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Julian Foad-5


On 05.01.2019 17:53, Julian Foad wrote:

> Stefan Kueng wrote:
>> When running blame on an utf16 text file, the lines can not be used
>> since the blame callback only passes a 'char *' parameter which means it
>> ends at the first zero char. But actually, svn knows if the line has
>> more content.
>
> +1 on doing something to fix the problem.
>
> Just briefly:
> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;

Ok, I'll change that and send a new patch.

> * splitting the lines on a bare LF byte, and having a left-over zero byte, sounds just so wrong... it can't be right, can it? Can't we do something better? Use a callback for deciding how to split lines, or something? I haven't looked at the code, so not sure what exactly.

Well, without doing several checks for what kind of encoding the line
has this won't be possible. I'm just glad that the function as is does
not stop at the zero char but only on 'lf' - so even if the file is
really binary it "works".

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Mark Phippard-3


On 05.01.2019 18:05, Mark Phippard wrote:
>
>> On Jan 5, 2019, at 10:58 AM, Stefan Kueng <[hidden email]> wrote:
>>
>> So, with this patch I'm able to do a blame on utf16 files, so I'd like to commit this if there are no objections.
>>
>
> I have no objections, just a question. I thought SVN treated utf16 files as binary.  How does blame work at all?  I thought diff does not even work for utf16.  Does this require setting svn:mime-type to text/plain or something like that as well?

In default mode, yes.
But the --force flag makes it run on every kind of file.

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Julian Foad-5


On 05.01.2019 17:53, Julian Foad wrote:

> Stefan Kueng wrote:
>> When running blame on an utf16 text file, the lines can not be used
>> since the blame callback only passes a 'char *' parameter which means it
>> ends at the first zero char. But actually, svn knows if the line has
>> more content.
>
> +1 on doing something to fix the problem.
>
> Just briefly:
> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;
here's a patch using svn_stringbuf_t for review.

Stefan

blame6_2.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Daniel Shahaf-2
Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
> here's a patch using svn_stringbuf_t for review.

Change "Provided for backwards compatibility with the 1.6 API" to
"Provided for backwards compatibility with the 1.11 API" in
svn_client_blame5()'s docstring.
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng


On 05.01.2019 22:35, Daniel Shahaf wrote:
> Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
>> here's a patch using svn_stringbuf_t for review.
>
> Change "Provided for backwards compatibility with the 1.6 API" to
> "Provided for backwards compatibility with the 1.11 API" in
> svn_client_blame5()'s docstring.
>

done. See attached patch.

Stefan

blame6_3.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Branko Čibej
On 06.01.2019 14:10, Stefan Kueng wrote:

>
>
> On 05.01.2019 22:35, Daniel Shahaf wrote:
>> Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
>>> here's a patch using svn_stringbuf_t for review.
>>
>> Change "Provided for backwards compatibility with the 1.6 API" to
>> "Provided for backwards compatibility with the 1.11 API" in
>> svn_client_blame5()'s docstring.
>>
>
> done. See attached patch.

Sorry about getting into this late, but as Julian said:

> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;

but the patch has:

> @@ -758,6 +758,28 @@
> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>
> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>
> ·*
> +·*·@since·New·in·1.12.
> +·*/
> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
> +··void·*baton,
> +··svn_revnum_t·start_revnum,
> +··svn_revnum_t·end_revnum,
> +··apr_int64_t·line_no,
> +··svn_revnum_t·revision,
> +··apr_hash_t·*rev_props,
> +··svn_revnum_t·merged_revision,
> +··apr_hash_t·*merged_rev_props,
> +··const·char·*merged_path,
> +··svn_stringbuf_t·*line,
> +··svn_boolean_t·local_change,
> +··apr_pool_t·*pool);


The svn_stringbuf_t struct is not appropriate here; please use
svn_string_t. The former is used as a buffer to construct larger
strings, that's why it contains a reference to a pool and a blocksize.
the blame receiver gets data that are already allocated from a pool and
should be immutable to the receiver; the appropriate declaration here is

    const svn_string_t *line;


as you only need the data pointer and the length, and very much do _not_
need the pool and blocksize, nor do you want the data or length to be
modifiable by the callback.

That will make the changes in libsvn_client/blame.c a bit more complex
since you'll have to create a local svn_string_t object (on stack)
before calling the receiver, but it makes the receiver's interface
cleaner. Something like this would work:

@@ -941,18 +941,20 @@ svn_client_blame5(const char *target,
             SVN_ERR(ctx->cancel_func(ctx->cancel_baton));
           if (!eof || sb->len)
             {
+              const svn_string_t line = {sb->data, sb->len};
+
               if (walk->rev)
                 SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
                                  line_no, walk->rev->revision,
                                  walk->rev->rev_props, merged_rev,
                                  merged_rev_props, merged_path,
-                                 sb->data, FALSE, iterpool));
+                                 &line, FALSE, iterpool));
               else
                 SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
                                  line_no, SVN_INVALID_REVNUM,
                                  NULL, SVN_INVALID_REVNUM,
                                  NULL, NULL,
-                                 sb->data, TRUE, iterpool));
+                                 &line, TRUE, iterpool));
             }
           if (eof) break;
         }



Other than that, I think it's a bad idea to leave the trailing null byte
from the UTF-16-LE representation of U+000a at the beginning of the next
"line". If we're making this change in a *public* API, we should not
expect all users to hack around such a misfeature.

Until we have better support for other Unicode representations, we
should detect that null byte and include it as part of the reported
blame line.

-- Brane

Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Branko Čibej
On 06.01.2019 15:05, Branko Čibej wrote:
> Until we have better support for other Unicode representations, we
> should detect that null byte and include it as part of the reported
> blame line.

That might also imply detecting that the file encoding is not UTF-16-BE
instead ... which could very likely also have a null byte after the 0x0a
that does belong to the next line. I'd suggest checking for odd/even
offsets of the newline byte, since both UTF-16 representations must have
an even number of bytes in each line. It would be nice to avoid
treating, e.g., U+010A as a line break ...

-- Brane

Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Julian Foad-5
In reply to this post by Branko Čibej
Branko Čibej wrote:
> +              const svn_string_t line = {sb->data, sb->len};

In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().
--
- Julian
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Branko Čibej
On 06.01.2019 15:22, Julian Foad wrote:
> Branko Čibej wrote:
>> +              const svn_string_t line = {sb->data, sb->len};
> In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().

I've noticed that clang is a bit deficient when it comes to pedantic C90
warnings. Sorry.

-- Brane

Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Branko Čibej


On 06.01.2019 15:05, Branko Čibej wrote:

> Sorry about getting into this late, but as Julian said:
>
>> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;
>
> but the patch has:
>
>> @@ -758,6 +758,28 @@
>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>>
>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>>
>> ·*
>> +·*·@since·New·in·1.12.
>> +·*/
>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
>> +··void·*baton,
>> +··svn_revnum_t·start_revnum,
>> +··svn_revnum_t·end_revnum,
>> +··apr_int64_t·line_no,
>> +··svn_revnum_t·revision,
>> +··apr_hash_t·*rev_props,
>> +··svn_revnum_t·merged_revision,
>> +··apr_hash_t·*merged_rev_props,
>> +··const·char·*merged_path,
>> +··svn_stringbuf_t·*line,
>> +··svn_boolean_t·local_change,
>> +··apr_pool_t·*pool);
>
>
> The svn_stringbuf_t struct is not appropriate here; please use
> svn_string_t. The former is used as a buffer to construct larger
> strings, that's why it contains a reference to a pool and a blocksize.
> the blame receiver gets data that are already allocated from a pool and
> should be immutable to the receiver; the appropriate declaration here is
>
>      const svn_string_t *line;
good point.

> as you only need the data pointer and the length, and very much do _not_
> need the pool and blocksize, nor do you want the data or length to be
> modifiable by the callback.
>
> That will make the changes in libsvn_client/blame.c a bit more complex
> since you'll have to create a local svn_string_t object (on stack)
> before calling the receiver, but it makes the receiver's interface
> cleaner. Something like this would work:

see attached patch.

> Other than that, I think it's a bad idea to leave the trailing null byte
> from the UTF-16-LE representation of U+000a at the beginning of the next
> "line". If we're making this change in a *public* API, we should not
> expect all users to hack around such a misfeature.

problem is: if we just skip any null char after an lf, then we would
screw up utf16be encodings (or le? I always get those two confused).

> Until we have better support for other Unicode representations, we
> should detect that null byte and include it as part of the reported
> blame line.

another situation which unfortunately is not that uncommon: files that
have different encodings in different lines or even inside the same
line. I've seen that especially in log files where the logger first
prints timestamp in ascii, but then the log-text is in utf8 or even
utf16. Such situations I think would require too much detection logic
for the blame function and should be left to clients - they might not be
able to even show such lines so why bother trying to do the detection
right for that.

Stefan

blame6_4.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Julian Foad-5


On 06.01.2019 15:22, Julian Foad wrote:
> Branko Čibej wrote:
>> +              const svn_string_t line = {sb->data, sb->len};
>
> In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().
>

but that destroys the original stringbuf, but that one is still needed.
So we can't use that helper function.

Stefan

Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Branko Čibej
In reply to this post by Stefan Kueng
On 06.01.2019 17:16, Stefan Kueng wrote:

>
>
> On 06.01.2019 15:05, Branko Čibej wrote:
>
>> Sorry about getting into this late, but as Julian said:
>>
>>> * we already have a (char*, len) wrapper, called svn_string_t, so I
>>> would expect it would be neatest to use that;
>>
>> but the patch has:
>>
>>> @@ -758,6 +758,28 @@
>>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>>>
>>>
>>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>>>
>>>
>>> ·*
>>> +·*·@since·New·in·1.12.
>>> +·*/
>>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
>>> +··void·*baton,
>>> +··svn_revnum_t·start_revnum,
>>> +··svn_revnum_t·end_revnum,
>>> +··apr_int64_t·line_no,
>>> +··svn_revnum_t·revision,
>>> +··apr_hash_t·*rev_props,
>>> +··svn_revnum_t·merged_revision,
>>> +··apr_hash_t·*merged_rev_props,
>>> +··const·char·*merged_path,
>>> +··svn_stringbuf_t·*line,
>>> +··svn_boolean_t·local_change,
>>> +··apr_pool_t·*pool);
>>
>>
>> The svn_stringbuf_t struct is not appropriate here; please use
>> svn_string_t. The former is used as a buffer to construct larger
>> strings, that's why it contains a reference to a pool and a blocksize.
>> the blame receiver gets data that are already allocated from a pool and
>> should be immutable to the receiver; the appropriate declaration here is
>>
>>      const svn_string_t *line;
>
> good point.
>
>> as you only need the data pointer and the length, and very much do _not_
>> need the pool and blocksize, nor do you want the data or length to be
>> modifiable by the callback.
>>
>> That will make the changes in libsvn_client/blame.c a bit more complex
>> since you'll have to create a local svn_string_t object (on stack)
>> before calling the receiver, but it makes the receiver's interface
>> cleaner. Something like this would work:
>
> see attached patch.
>
>> Other than that, I think it's a bad idea to leave the trailing null byte
>> from the UTF-16-LE representation of U+000a at the beginning of the next
>> "line". If we're making this change in a *public* API, we should not
>> expect all users to hack around such a misfeature.
>
> problem is: if we just skip any null char after an lf, then we would
> screw up utf16be encodings (or le? I always get those two confused).


Windows default is UTF-16-LE, at least on x86(_64) and other
little-endian architectures. I'm not sure what they do on ARM but I'd be
surprised if Windows doesn't put it in little-endian mode, given that
decades of legacy software assume little-endian.

A simple check would be:

  * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
    UTF-16-LE linefeed;
  * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
    then it's a UTF-16-BE linefeed;
  * otherwise just hope it's a linefeed and move on.


But given your example below of mixing ASCII/UTF-8/UTF-16, it's better
to just leave well enough alone. :(

>> Until we have better support for other Unicode representations, we
>> should detect that null byte and include it as part of the reported
>> blame line.
>
> another situation which unfortunately is not that uncommon: files that
> have different encodings in different lines or even inside the same
> line. I've seen that especially in log files where the logger first
> prints timestamp in ascii, but then the log-text is in utf8

... which isn't a problem since ASCII and UTF-8 are compatible ...

> or even utf16.

... but this is a rather huge problem.

> Such situations I think would require too much detection logic for the
> blame function and should be left to clients - they might not be able
> to even show such lines so why bother trying to do the detection right
> for that.


You're right about that. I wouldn't dream of supporting such things
within the blame callback itself. However it would still be nice to at
least document what's happening.

The patch itself looks OK to me now.

-- Brane

Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Daniel Shahaf-2
Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
> A simple check would be:
>
>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>     UTF-16-LE linefeed;
>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>     then it's a UTF-16-BE linefeed;

Would would happen if it were an ASCII/UTF-8 file that happened to
have a literal NUL byte next to an LF byte?  I have seen/used
some of those.

>   * otherwise just hope it's a linefeed and move on.

The encoding may also be set explicitly via a svn:mime-type="text/foo;
charset=utf-16-le" property.  (We even parse that in mod_dav_svn, I think?)

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Branko Čibej


On 06.01.2019 19:37, Branko Čibej wrote:

> Windows default is UTF-16-LE, at least on x86(_64) and other
> little-endian architectures. I'm not sure what they do on ARM but I'd be
> surprised if Windows doesn't put it in little-endian mode, given that
> decades of legacy software assume little-endian.
>
> A simple check would be:
>
>    * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>      UTF-16-LE linefeed;
>    * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>      then it's a UTF-16-BE linefeed;
>    * otherwise just hope it's a linefeed and move on.

looks good for proper files. But I have two files right here that are
utf8 encoded, but have also null bytes in it: the stupid app that writes
those uses the null byte to separate lines, and lf to separate
paragraphs. So for these files the check would fail.

>
> You're right about that. I wouldn't dream of supporting such things
> within the blame callback itself. However it would still be nice to at
> least document what's happening.

I'll add some more info to the doc string and resend the patch tomorrow.

The advantage if this is done in an UI client: if the detection of the
encoding is wrong, I can just add a button/combobox/whatever so the user
can choose the right encoding right there when showing the blame output.
If the detection is done in the svn lib and is wrong, then an UI client
could not do that.

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Stefan Kueng
In reply to this post by Daniel Shahaf-2


On 06.01.2019 19:54, Daniel Shahaf wrote:
> The encoding may also be set explicitly via a svn:mime-type="text/foo;
> charset=utf-16-le" property.  (We even parse that in mod_dav_svn, I think?)

Setting the mime-type to text/.. with the utf16 charset helps only that
the --force flag doesn't need to be passed to commands, but that all
(client-side) as far as I know.

Here's another patch with some more doc strings. I guess that can be
improved, but I'm really not good at writing docs...

Stefan


blame6_5.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Daniel Shahaf-2
Stefan Kueng wrote on Sun, Jan 06, 2019 at 20:40:28 +0100:

> +++ subversion/include/svn_client.h (working copy)
> @@ -736,10 +736,11 @@
>   * @{
>   */
>  
> -/** Callback type used by svn_client_blame5() to notify the caller
> +/** Callback type used by svn_client_blame6() to notify the caller
>   * that line @a line_no of the blamed file was last changed in @a revision
>   * which has the revision properties @a rev_props, and that the contents were
> - * @a line.
> + * @a line. The @a line content is delivered as is, it is up to the client to
> + * determine the encoding.

s/,/;/

Shouldn't the docstring explicitly say whether or not the value of @a
line includes the terminating newline?  (Preëxisting issue)

> @@ -758,6 +759,33 @@
>   * will be true if the reason there is no blame information is that the line
>   * was modified locally. In all other cases @a local_change will be false.
>   *
> + * @note if the text encoding of the file is not ASCII or utf8, the end-of-line
> + * detection may lead to lines having a one byte offset. It is up to the client

"One byte offset" is not true in general; it is true for UTF-16 but
there are other encodings in the world.  Besides, I would sooner point
out that if the file isn't in UTF-8 (including ASCII), the end of line
detection may be *wrong* since it looks for the byte 0x0A, which may not
even be part of a (possibly multibyte) newline character.

It's fine to give specific details about UTF-16, but we should give the
more generally-applicable information first.

> + * to handle these situations. Blaming non ASCII/utf8 files requires the

s/utf8/UTF-8/ (two instances)

> + * @a force flag to be set when calling the blame function.

s/the blame function/svn_client_blame6/

> + * @since New in 1.12.
> + */
> +typedef svn_error_t *(*svn_client_blame_receiver4_t)(
> +  void *baton,
> +  svn_revnum_t start_revnum,
> +  svn_revnum_t end_revnum,
> +  apr_int64_t line_no,
> +  svn_revnum_t revision,
> +  apr_hash_t *rev_props,
> +  svn_revnum_t merged_revision,
> +  apr_hash_t *merged_rev_props,
> +  const char *merged_path,
> +  const svn_string_t *line,
> +  svn_boolean_t local_change,
> +  apr_pool_t *pool);

> +++ subversion/libsvn_client/blame.c (working copy)
> @@ -676,6 +676,7 @@ svn_client_blame6(const char *target,
>    svn_stream_t *last_stream;
>    svn_stream_t *stream;
>    const char *target_abspath_or_url;
> +  svn_string_t line;
>  
>    if (start->kind == svn_opt_revision_unspecified
>        || end->kind == svn_opt_revision_unspecified)
> @@ -941,18 +942,20 @@
>              SVN_ERR(ctx->cancel_func(ctx->cancel_baton));
>            if (!eof || sb->len)
>              {
> +              line.data = sb->data;
> +              line.len = sb->len;

Reduce the scope of LINE to just this block?

>                if (walk->rev)
>                  SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
>                                   line_no, walk->rev->revision,
>                                   walk->rev->rev_props, merged_rev,
>                                   merged_rev_props, merged_path,
> -                                 sb->data, FALSE, iterpool));
> +                                 &line, FALSE, iterpool));
>                else
>                  SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
>                                   line_no, SVN_INVALID_REVNUM,
>                                   NULL, SVN_INVALID_REVNUM,
>                                   NULL, NULL,
> -                                 sb->data, TRUE, iterpool));
> +                                 &line, TRUE, iterpool));
>              }
>            if (eof) break;
>          }

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: extending the blame callback

Branko Čibej
In reply to this post by Daniel Shahaf-2
On 06.01.2019 19:54, Daniel Shahaf wrote:

> Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
>> A simple check would be:
>>
>>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>>     UTF-16-LE linefeed;
>>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>>     then it's a UTF-16-BE linefeed;
> Would would happen if it were an ASCII/UTF-8 file that happened to
> have a literal NUL byte next to an LF byte?  I have seen/used
> some of those.

Yes, well, in that case the NUL just might randomly "move" from one line
to the next, depending on changes in the file. Nothing we can do
short-term will fix that ... and as far as I'm concerned that's not a
valid text file, so it won't disturb me too much if blame results are a
bit fuzzy in such cases.



>>   * otherwise just hope it's a linefeed and move on.
> The encoding may also be set explicitly via a svn:mime-type="text/foo;
> charset=utf-16-le" property.


Yes, but supporting _that_ in our code is a much, much bigger task than
what we're talking about in this thread. What I proposed here would be a
simple, localised hack to avoid splitting lines in the middle of a code
sequence. If we do decide to implement something like that for line
breaks, we should extend it to both variants of UTF-32 as well.

(Also, note the abuse of the 'charset' attribute in Content-Type ... the
"character set" is Unicode, it should be Content-Encoding: UTFsomething,
but that's used for something different again :\ )


>   (We even parse that in mod_dav_svn, I think?)

Even if we do, it's irrelevant to this discussion. We set the value of
the Content-Type header based on svn:mime-type for a simple GET request,
but don't interpret it otherwise on the server side.

-- Brane
123