(swig-py3) how to handle "invalid" property value?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

(swig-py3) how to handle "invalid" property value?

Yasuhito FUTATSUKI
Hi,
I hear that property values may not be vaild UTF-8 string if they are
set with skipping validation.
(https://twitter.com/jun66j5/status/1067295499907084288 (written in
Japanese language); https://trac.edgewall.org/ticket/4321)

However, current swig-py3 typemaps to return those values use
PyStr_FromStringAndSize() to convert svn_string_t into Python's str object,
which raise UnicodeDecodeError for invalid UTF-8 octet sequence, so there
is no way to get those strict value.

To resolve this issue, there seems to be some some options:

(1). Those API always return str (Unicode) with 'strict' conversion.
     if error occured, abandon to get these values. (current implementation)
(2). Those API always return str with 'surrogateescape' conversion.
     if applicatin want to detect irregular data, search \u+dc00-\u+dcff
     character in values.
(3). Those API always return bytes. if applications want to handle as
     str, decode them in application side.
(4). Those API return str for valid data, and return bytes for invalid data
     to avoid missing way to get data.
(5). other (I have no idea..., though)

I think (2) or (3) is appropriate, but I don't have confidence.
Any ideas?
--
Yasuhito FUTATSUKI
Reply | Threaded
Open this post in threaded view
|

Re: (swig-py3) how to handle "invalid" property value?

Daniel Shahaf-2
Yasuhito FUTATSUKI wrote on Tue, Dec 11, 2018 at 01:18:19 +0900:
> Hi,
> I hear that property values may not be vaild UTF-8 string if they are
> set with skipping validation.

Let's be more precise.

In general, a property's name must satisfy the conditions documented in
svn_prop_name_is_valid()'s docstring and a property's value is an opaque
binary blob, just like file contents (when svn:eol-style and svn:keywords
are unset).

However, some specific properties have additional requirements on their
values.  If a property passes svn_prop_is_boolean() then its value MUST
be "*" and if a property is an svn:* property then its value MUST be
UTF-8 with LF line endings — the latter is enforced by svn_repos__validate_prop().

> (https://twitter.com/jun66j5/status/1067295499907084288 (written in
> Japanese language); https://trac.edgewall.org/ticket/4321)
>
> However, current swig-py3 typemaps to return those values use
> PyStr_FromStringAndSize() to convert svn_string_t into Python's str object,
> which raise UnicodeDecodeError for invalid UTF-8 octet sequence, so there
> is no way to get those strict value.
>
> To resolve this issue, there seems to be some some options:
>
> (1). Those API always return str (Unicode) with 'strict' conversion.
>     if error occured, abandon to get these values. (current implementation)
> (2). Those API always return str with 'surrogateescape' conversion.
>     if applicatin want to detect irregular data, search \u+dc00-\u+dcff
>     character in values.
> (3). Those API always return bytes. if applications want to handle as
>     str, decode them in application side.
> (4). Those API return str for valid data, and return bytes for invalid data
>     to avoid missing way to get data.
> (5). other (I have no idea..., though)
>
> I think (2) or (3) is appropriate, but I don't have confidence.
> Any ideas?

Generic APIs that work on any property should return bytes.  More
specific APIs that work on properties that have further restrictions
(svn:needs-lock, svn:date), or even a structured value (svn:externals,
svn:mergeinfo), may use appropriately more specific data types.

In general, if an API has an input of type bytes and needs to return
str, I would have it throw an exception if the conversion fails, so the
caller is forced to deal with the failure mode explicitly.  It'd be fine
to use surrogateescape or return bytes *if the caller has explicitly
requested that*, but the default postcondition should be as simple as
possible: "This function either returns str or raises an exception".

Makes sense?

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: (swig-py3) how to handle "invalid" property value?

Yasuhito FUTATSUKI
Thank you to make my question clear.

Actually, I misunderstood property values and its validation.

On 12/11/18 5:12 PM, Daniel Shahaf wrote:
> Generic APIs that work on any property should return bytes.  More
> specific APIs that work on properties that have further restrictions
> (svn:needs-lock, svn:date), or even a structured value (svn:externals,
> svn:mergeinfo), may use appropriately more specific data types.

I see. Then I'll try to fix those API wrappers returns propety values
to return as bytes in py3. I think precise restriction check for
properties is not the matter of wrappers, so I don't touch it.
 
> In general, if an API has an input of type bytes and needs to return
> str, I would have it throw an exception if the conversion fails, so the
> caller is forced to deal with the failure mode explicitly.  It'd be fine
> to use surrogateescape or return bytes *if the caller has explicitly
> requested that*, but the default postcondition should be as simple as
> possible: "This function either returns str or raises an exception".

I'm sorry, I meant "Those APIs" as not generic APIs, but "those APIs
return propety values", and I don't want to hide exception if there
are ways not to do so, too. (I've searched the way to add keywords
arguments with default values like charset, errors, or return_type
with make relation to argment for props, but I couldn't)
However, I also think if C API returns value successfuly and the wrapper
fails to return its value "unexpectedly", the wrapper is incomplete.

My interest is "what should care as bytes object and what should
care as str object Python 3 API wrapper". Current typemaps seems
almost all char *, svn_string_t * are mapped to str object in py3.

Thanks,
--
Yasuhito FUTATSUKI
Reply | Threaded
Open this post in threaded view
|

Re: (swig-py3) how to handle "invalid" property value?

Daniel Shahaf-2
Yasuhito FUTATSUKI wrote on Tue, 11 Dec 2018 19:22 +0900:

> On 12/11/18 5:12 PM, Daniel Shahaf wrote:
> > Generic APIs that work on any property should return bytes.  More
> > specific APIs that work on properties that have further restrictions
> > (svn:needs-lock, svn:date), or even a structured value (svn:externals,
> > svn:mergeinfo), may use appropriately more specific data types.
>
> I see. Then I'll try to fix those API wrappers returns propety values
> to return as bytes in py3. I think precise restriction check for
> properties is not the matter of wrappers, so I don't touch it.
>  

Agreed: in general, the wrappers should follow the behaviour of the C
API's as closely as possible.  If a random C API doesn't validate
properties as it returns them, neither should its Python wrapper.

> > In general, if an API has an input of type bytes and needs to return
> > str, I would have it throw an exception if the conversion fails, so the
> > caller is forced to deal with the failure mode explicitly.  It'd be fine
> > to use surrogateescape or return bytes *if the caller has explicitly
> > requested that*, but the default postcondition should be as simple as
> > possible: "This function either returns str or raises an exception".
>
> I'm sorry, I meant "Those APIs" as not generic APIs, but "those APIs
> return propety values", and I don't want to hide exception if there
> are ways not to do so, too. (I've searched the way to add keywords
> arguments with default values like charset, errors, or return_type
> with make relation to argment for props, but I couldn't)

Interesting idea.  (I don't know swig well enough to comment on it.)

> However, I also think if C API returns value successfuly and the wrapper
> fails to return its value "unexpectedly", the wrapper is incomplete.

Sure.

> My interest is "what should care as bytes object and what should
> care as str object Python 3 API wrapper". Current typemaps seems
> almost all char *, svn_string_t * are mapped to str object in py3.

Returning str would have been correct under py2.  It might be just that
the typemaps haven't been updated for the str/bytes changes in py3.
The log of the swig-py3 branch might have some clues.

There _are_ some cases of char* / svn_string_t* that semantically
correspond to py3 str, for example, svn_cmdline_printf() --- although
that's not a good example, as it's variadic --- but even so, if it's be
easier to implement the rule "_all_ char* and svn_string_t* in C
signatures are expected to be 'bytes' in Python", I'll support that.
It'll be easier for API consumers to learn the rule "Always .encode()
your str's" than to have to learn which functions are exceptions to the rule.

Cheers,

Daniel