Proposal - option to store unzipped office documents on server side.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Proposal - option to store unzipped office documents on server side.

Paul Hammant-3
I understand the process to propose feature requests is to first post the idea on this forum, then if everyone agrees, post an item to Jira.

I previously wrote a blog entry on Git storing MsOffice document unzipped (.docx and .xlsx are zips XML and a few things) - https://paulhammant.com/2015/07/30/git-storing-unzipped-office-docs/

That blog entry links to two diffs. One for a regular unzipped office doc, and one that was  unzipped but reformatted before commit.  After the reformat, the contents of the Mary.docx_ folder can be rezipped and the at least the Mac's MsWord opens it fine - I tested it.

It would be great if there were a setting somewhere like:

unzip_these_suffiixes_on_serverside_but_reconstitute_zipped_form_for_client_interop: docx, xlsx, pptx

(or 'unzip_rezip_suffixes')

So Subversion gets to store the text forms (rather than the zip), with improved opportunities for storage savings via the recently discussed fsfs options.

Better still:

unzip_reformat_store_rezip_on_access_suffixes: docx, xlsx, pptx
xml_reformat_command: xmllint --format

Of course as well as storage benefits, there are diff benefits, but today that would most likely be via one of the web portal installs that bundle subversion where people do most of their diff viewing these days (and incidentally one aspect of the game changer that GitHub was from when it emerged from from alpha on February 8, 2008).


Unrelated quick question: is there record of the Subversion 1.0 celebration event at Colllabnet in Brisbane in 2004/5 ?  I think there were a couple of notable absentees, but that sort of thing shouldn't disappear from the internet.

- Paul

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal - option to store unzipped office documents on server side.

Philip Martin
Paul Hammant <[hidden email]> writes:

> It would be great if there were a setting somewhere like:
>
> unzip_these_suffiixes_on_serverside_but_reconstitute_zipped_form_for_client_interop:
> docx, xlsx, pptx
>
> (or 'unzip_rezip_suffixes')
>
>
> So Subversion gets to store the text forms (rather than the zip), with

You are not the first to ask for this, but it is significantly more
complex than just a backend setting.

Given unzipped data U there is no single, canonical, compressed form Z
that represents U.  Instead there are multiple forms Z1, Z2, ... that
all expand to U.  If the client sends Z1 there is no guarantee that the
server will be able to recreate Z1 from U, it might produce Z2, Z3, ...
I suppose a version control system could be designed to allow you to
commit Z1 and get back any of Z1, Z2, Z3, ... but Subversion makes, and
assumes, the exact opposite: that you get back exactly what you commit.
If you break that assumption changes will be necessary all through
Subversion including, but not limited to, delta transfer, working copy
storage, checksum verification, client post-commit processing,
client-server compatibility, server upgrades, etc.

--
Philip
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal - option to store unzipped office documents on server side.

Paul Hammant-3

You are not the first to ask for this, but it is significantly more
complex than just a backend setting. [....]

Yup, I didn't think about the SHA1 being different.  I'll implement it client-side, just ignore this request.

-ph
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Proposal - option to store unzipped office documents on server side.

Markus Schaber
Hi, Paul,

From: Paul Hammant [mailto:[hidden email]]
> > You are not the first to ask for this, but it is significantly more
> > complex than just a backend setting. [....]

> Yup, I didn't think about the SHA1 being different.  I'll implement it client-side, just ignore this request.

Note that you will already profit from differential storage even when using compressed zips, as included images etc. should have an identical compressed storage if the user is just doing text edits).

We got similar benefits at my former employer when we stored JAR files in SVN (archival of old revisions); especially small patches and hotfixes with only a few changed class files were transferred and stored efficiently.

You could also transparently transform the zip into an uncompressed zip file before uploading it, thus getting more benefits from differential storage - at the possible loss of bit-for-bit identity if you try to recompress the zip again on download, as different implementations of the algorithms tend to give slightly different results, even for the same compression level. On the other hand, this could even improve storage on the server when configured with a high compression level, as the server side compression can then exploit inter-file redundancies in the zip file (if present), while zip itself compresses all files independently.

Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions
________________________________________
3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: [hidden email] | Web: codesys.com | CODESYS store: store.codesys.com
CODESYS forum: forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
________________________________________
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal - option to store unzipped office documents on server side.

Paul Hammant-3


We got similar benefits at my former employer when we stored JAR files in SVN (archival of old revisions); especially small patches and hotfixes with only a few changed class files were transferred and stored efficiently.

You could also transparently transform the zip into an uncompressed zip file before uploading it, thus getting more benefits from differential storage - at the possible loss of bit-for-bit identity if you try to recompress the zip again on download, as different implementations of the algorithms tend to give slightly different results, even for the same compression level. On the other hand, this could even improve storage on the server when configured with a high compression level, as the server side compression can then exploit inter-file redundancies in the zip file (if present), while zip itself compresses all files independently.

Not related to Svn, but elsewhere I was working towards Maven Jars in Git, too ......

"Github releases as a Maven repo" 

So I have 27 releases of XStream unzipped and pushed to  https://github.com/paul-hammant/mc-xs-classes  In terms of size 8.4M of Jars is now 2.4M of bare .git repo. I wrote about 0.01% of XStream back in the day in case you were interested.

All the jars are still available - here - https://github.com/paul-hammant/mc-xs-classes/releases (the signatures don't match those from Maven Central, but that is not important right now).  Those jars are mere tags in Git. GitHub via their 'releases' feature has done the rest.  I have a Python3 script that can spider a Maven group/artifact and push all the releases to GH quite quickly.

Note: the size of the .git/ folder doesn't change regardless of the order of unzipping jars and piushing their contents to the repo. Git doesn't store deltas, and uses a DEFLATE algorithm for storage. Diffs are meaningless on binary files, of course.

And yes, the idea is a that teams can quickly host their own binary deps on GitHub and not bother with Maven Central at all. This might suit experimental things, more than mainstream.  It also might suit Snapshot releases.  I've a blog entry on the general idea of git as a MavenCentral alternate - https://paulhammant.com/2017/05/13/maven-central-as-multiple-git-repos/ and other postings in various Maven mail lists that share code, if you're interested.

-ph
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal - option to store unzipped office documents on server side.

Philip Martin
Paul Hammant <[hidden email]> writes:

> Git doesn't store deltas, and uses a DEFLATE algorithm for
> storage. Diffs are meaningless on binary files, of course.

I don't know about git but Subversion does quite a good job on some
binary files.  Take the compressed tarballs of a couple of Subversion
tags:

   $ svn export http://svn.apache.org/repos/asf/subversion/tags/1.9.5
   $ svn export http://svn.apache.org/repos/asf/subversion/tags/1.9.4
   $ tar cfz foo1.tar.gz 1.9.5
   $ tar cfz foo2.tar.gz 1.9.4
   $ svnadmin create repo
   $ svnmucc -mm -U file://`pwd`/repo put foo1.tar.gz f.tgz
   $ svnmucc -mm -U file://`pwd`/repo put foo2.tar.gz f.tgz

How big are the tarballs?

   $ ls -lh foo*
   -rw-r--r-- 1 pm pm 15M Aug  4 13:00 foo1.tar.gz
   -rw-r--r-- 1 pm pm 15M Aug  4 13:00 foo2.tar.gz

How big in the repository?

   $ ls -lh repo/db/revs/0/[12]
   -r--r--r-- 1 pm pm 15M Aug  4 13:00 repo/db/revs/0/1
   -r--r--r-- 1 pm pm 13M Aug  4 13:00 repo/db/revs/0/2

Saving about 2M. But we can do better if we do compression knowing that
deltification will be used:

   $ tar cf foo1.tar 1.9.5
   $ tar cf foo2.tar 1.9.4
   $ gzip --rsyncable foo1.tar
   $ gzip --rsyncable foo2.tar

The resulting tarballs are little bigger:

   -rw-r--r-- 1 pm pm 16M Aug  4 13:05 foo1.tar.gz
   -rw-r--r-- 1 pm pm 16M Aug  4 13:05 foo2.tar.gz

but Subversion can do better deltification:

   -r--r--r-- 1 pm pm  16M Aug  4 13:05 repo/db/revs/0/1
   -r--r--r-- 1 pm pm 5.6M Aug  4 13:05 repo/db/revs/0/2

We have stored two 15MB compressed tarballs in a 21MB repository.

--
Philip
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal - option to store unzipped office documents on server side.

Daniel Shahaf-2
In reply to this post by Philip Martin
Philip Martin wrote on Thu, Aug 03, 2017 at 19:01:16 +0100:
> Given unzipped data U there is no single, canonical, compressed form Z
> that represents U.  Instead there are multiple forms Z1, Z2, ... that
> all expand to U.  If the client sends Z1 there is no guarantee that the
> server will be able to recreate Z1 from U, it might produce Z2, Z3, ...
> I suppose a version control system could be designed to allow you to
> commit Z1 and get back any of Z1, Z2, Z3, ... but Subversion makes, and
> assumes, the exact opposite: that you get back exactly what you commit.

One could define U as the repository normal form and Z1, Z2, Z3... as
the translated forms.  Then, the data will be compressed in the
repository (due to svndiff1/svndiff2), on the wire (likewise), and in
the WORKING tree (due to translation).  However, the text bases won't be
compressed, and 'svn cat URL@peg' would give the decompressed form.

And if anybody wanted to PGP-sign Z1, they'd have to exclude it from the
automatic decompression.

(In this context, "translation" is the client-side transformation that
effects svn:eol-style and svn:keywords.)

Cheers,

Daniel

> If you break that assumption changes will be necessary all through
> Subversion including, but not limited to, delta transfer, working copy
> storage, checksum verification, client post-commit processing,
> client-server compatibility, server upgrades, etc.
>
> --
> Philip
Loading...