Missing LOCALE in post-commit hook leads to weird behaviour of `svnlook log` with unicode characters – broken transliterations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Missing LOCALE in post-commit hook leads to weird behaviour of `svnlook log` with unicode characters – broken transliterations

H.-Dirk Schmitt-2
I found a very weird behaviour of `svnlook log` that IMHO is a bug (or
at least a serious missing documentation issue).

Introduction
------------

Consider a log message like: 'Unicode Test → ø ÄÖÜ'

`svnlook  log` invoked in a normal terminal session shows the proper
content.
This works because the environment is set to 'en_US.UTF-8'.

Now start to play - `env LC_ALL=C.UTF-8 svnlook log` also shows a
correct result.

Problem
-------
But falling back to `env LC_ALL=C svnlook log`  I got a very flawed
result:

Unicode Test {U+2192} {U+00F8} AOU

→ and ø are replaced with there code description
The German Umlaut chars are translitterated in a very uncommon way.
In the old ASCII/type-writer days Ä was translitterated in Ae (Ö → Oe,
…)

Why is this behaviour not a cosmetic problem.
---------------------------------------------

Consider a post-commit hook fetching the commit message with `svnlook
log`.
Purpose is to postprocess the log message content, e.g. append to
bugzilla issues.

The actual setup is svn+apache2 and a bash script as post commit hook.
The machine locatle as reported by `localectl`: System Locale:
LANG=en_US.utf8

All the commit messages content transfered is broken as described
above.

This happens because the post-commit hook is running with a very
reduced set of environment variables:
   PWD=/
   SHLVL=1

Especially `LC_ALL` is not set which is eqivalent to `LC_ALL=C`.

Suggested Mitigation/Fixing
---------------------------
1. Subversion should ensure that the system locale is forwarded to the
post-commit hook.
2. `svnlook` shoud support the `--encoding` switch
3. German Umlaute (and surely some other national characters in the 8-
bit range) shouldn't translittered in a different
   way as unicode characters (see ø / {U+00F8}).


PS: Google et. al. haven't shown that this issue is well documented.