File and folder names corrupted when importing from CVS using cvs2svn

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

File and folder names corrupted when importing from CVS using cvs2svn

Bo Berglund
A few weeks ago I migrated our CVS repositories to SVN using the
latest version of cvs2svn. The CVS server was on a Windows 2003 server
and the new VisualSVN server on a Windows 2016 Server. I did the
cvs2svn conversion on a Ubuntu server to which I had copied the CVS
repositories.
Then I imported the dump files on Windows 2016 via the VisualSVN GUI
commands.

It all looked like it was a success except for the handling of file
properties. Files marked binary in CVS are not marked as such in SVN
so I worry that they may become corrupted in the future. But it seems
like it does work OK.

Today I discovered a different issue that is more serious in nature:
In some projects there have been non-US characters used for file and
folder names like "Bygglovsansökan", "HÅL I TAKKASSETT" etc.

When I check out these projects from SVN the Swedish characters in the
names are now replaced by a series of high characters (hex view):

Å = C3 90 C2 9F
Ä = C3 90 C2 9E
Ö = -- did not find this character in svn yet --
å = C3 90 C2 96
ä = C3 90 C2 94
ö = C3 90 C2 A4

I don't know from where this problem originates, either it is a flaw
in the cvs2svn script, the configuration of the conversion or in the
format of the generated dump files.
Otherwise it may be a problem when importing the dump files into the
VisualSVN server....

What could I do to fix this?
(And please note that the new repository is in use so there are a
number of commits done since the migration...)


--
Bo Berglund
Developer in Sweden

Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Bo Berglund
On Thu, 18 Jan 2018 17:38:04 +0100, Bo Berglund
<[hidden email]> wrote:

>I don't know from where this problem originates, either it is a flaw
>in the cvs2svn script, the configuration of the conversion or in the
>format of the generated dump files.
>Otherwise it may be a problem when importing the dump files into the
>VisualSVN server....

I made a test by creating a new file in the working copy named as
follows:
Testing_Å_Ä_Ö_å_ä_ö.txt

Then I added it and committed.
Then I used the VisualSVN repository web browser and found the file
with the correct name. So it seems like the conversion from CVS to Svn
is where the screw-up is located...

Still, what do I do now?


--
Bo Berglund
Developer in Sweden

Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Branko Čibej
On 18.01.2018 17:51, Bo Berglund wrote:

> On Thu, 18 Jan 2018 17:38:04 +0100, Bo Berglund
> <[hidden email]> wrote:
>
>> I don't know from where this problem originates, either it is a flaw
>> in the cvs2svn script, the configuration of the conversion or in the
>> format of the generated dump files.
>> Otherwise it may be a problem when importing the dump files into the
>> VisualSVN server....
> I made a test by creating a new file in the working copy named as
> follows:
> Testing_Å_Ä_Ö_å_ä_ö.txt
>
> Then I added it and committed.
> Then I used the VisualSVN repository web browser and found the file
> with the correct name. So it seems like the conversion from CVS to Svn
> is where the screw-up is located...

AFAIR you did not convert your CVS repositories on the same machine that
you used as the CVS server, correct? So ... you may not have used the
same character encoding during conversion as during normal operations.
As a guess I'd say that your (Windows, CVSNT) server uss the Windows
Latin 1 ("Western") encoding, and your (Linux) machine where you did the
conversion uses UTF-8.

If that's the case, it's not surprising that accented characters were
converted improperly.

(FWIW, the hex codes you show are valid UTF-8 but the characters they
encode have no relation to the originals.)

> Still, what do I do now?

Two options:

  * If you don't care about history, just rename all the offending files
    in the repository to their proper names.
  * If you *do* care about history, repeat the conversion, using the
    correct locale settings, then use svnsync to bring the correctly
    converted repositories up to date.
      o Alternatively, edit the original dump files and fix the file
        names there (they have to be encoded in UTF-8) to avoid having
        to repeat the conversion from CVS.

The second option is going to be extremely tricky.


-- Brane

Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Andreas Krey
In reply to this post by Bo Berglund
On Thu, 18 Jan 2018 17:38:04 +0000, Bo Berglund wrote:
...
> When I check out these projects from SVN the Swedish characters in the
> names are now replaced by a series of high characters (hex view):
>
> Å = C3 90 C2 9F

This is strange - it superficially looks like a double ISO-8859-1 to
utf8 conversion, but it isn't. Å is C5 in 8859-1 (and in Windows Latin
1), and that is represented as c3 85 in utf8, and doing the conversion
twice yields c3 83 c2 85 which looks similar to yours, but isn't the same.

Doing that in reverse C3 90 C2 9F goes back to D0 9F which is the code
point 41F (CYRILLIC CAPITAL LETTER PE). Strange.

> What could I do to fix this?
> (And please note that the new repository is in use so there are a
> number of commits done since the migration...)

Standard SVN answer 'you should have...', in this case '...tested this
aspect before'.

Now I guess your best bet is to rename these files to the proper
thing (or remove them, as they are apparently not needed. :-) Old
history will look broken (but as nobody immediately had errors
with those trees perhaps that doesn't matter either).

svndumping, filtering and reloading may fix the file names for
all revisions, but I have no idea how the client sandboxes
will react to that.

- Andreas

--
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800
Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Daniel Shahaf-2
Andreas Krey wrote on Thu, 18 Jan 2018 19:14 +0100:
> svndumping, filtering and reloading may fix the file names for
> all revisions, but I have no idea how the client sandboxes
> will react to that.

The term is "working copies", and if when changing the history one should
'svnadmin setuuid' the repository, which will cause all network
operations from old working copies to error out up front with a UUID
mismatch error message (as desired).
Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Nico Kadel-Garcia-2
On Thu, Jan 18, 2018 at 6:55 PM, Daniel Shahaf <[hidden email]> wrote:
> Andreas Krey wrote on Thu, 18 Jan 2018 19:14 +0100:
>> svndumping, filtering and reloading may fix the file names for
>> all revisions, but I have no idea how the client sandboxes
>> will react to that.
>
> The term is "working copies", and if when changing the history one should
> 'svnadmin setuuid' the repository, which will cause all network
> operations from old working copies to error out up front with a UUID
> mismatch error message (as desired).

This is also when you rename the upstream Subversion hostname or reset
the URL for it altogether, so that old and broken working copies are
reported as completely absent. It helps prevent people from doing an
"svn switch" and expecting the histories to be consistent, and lets
them know that the old repository is unusaable: a new working copy
should be checked out.
Reply | Threaded
Open this post in threaded view
|

Re: File and folder names corrupted when importing from CVS using cvs2svn

Branko Čibej
On 24.01.2018 14:11, Nico Kadel-Garcia wrote:

> On Thu, Jan 18, 2018 at 6:55 PM, Daniel Shahaf <[hidden email]> wrote:
>> Andreas Krey wrote on Thu, 18 Jan 2018 19:14 +0100:
>>> svndumping, filtering and reloading may fix the file names for
>>> all revisions, but I have no idea how the client sandboxes
>>> will react to that.
>> The term is "working copies", and if when changing the history one should
>> 'svnadmin setuuid' the repository, which will cause all network
>> operations from old working copies to error out up front with a UUID
>> mismatch error message (as desired).
> This is also when you rename the upstream Subversion hostname or reset
> the URL for it altogether, so that old and broken working copies are
> reported as completely absent. It helps prevent people from doing an
> "svn switch" and expecting the histories to be consistent, and lets
> them know that the old repository is unusaable: a new working copy
> should be checked out.


Actually one cannot "svn switch" to a repository with a different UUID,
so changing that should be enough.

-- Brane