[PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Evgeny Kotkov
Hi all,

The recent SQLite versions (starting from 3.8.2, released in December 2013)
feature a `WITHOUT ROWID` optimization [1] that can be enabled when creating
a table.  In short, it works well for tables that have non-integer primary
keys, such as

    name TEXT PRIMARY KEY

by not maintaining the hidden rowid values and an another B-Tree to match
between a primary key value and its rowid.  This reduces the on-disk size
and makes the lookups faster (a key → rowid → data lookup is replaced with
a key → data lookup).

Currently, the rep-cache.db schema uses a non-integer primary key:

    hash TEXT NOT NULL PRIMARY KEY

and can benefit from this optimization.  A quick experiment showed a
reduction of the on-disk size of the database by ~1.75x.  The lookups
should also be faster, both due to the reduced database size and due to
the lesser amount of internal bsearches.  This should improve the times
of new commits and `svnadmin load`, especially for large repositories
that also have large rep-cache.db files.

I think that it would be nice to have this optimization in rep-cache.db,
and that we can start using it in a compatible way:

  - All existing rep-cache.db statements are compatible with it.

  - Since SQLite versions prior to 3.8.2 don't support it, we would
    only create the new tables with this optimization in fsfs format 8,
    and simultaneously bump the minimal required SQLite version from
    3.7.12 (May 2012) to 3.8.2 (December 2013).  This would ensure that
    all binaries supporting format 8 can work with the tables with this
    optimization.

Would there be any objections to a change like this (see the attached patch)?

[1] https://sqlite.org/withoutrowid.html


Thanks,
Evgeny Kotkov

fsfs-rep-cache-without-rowid-v1.patch.txt (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Evgeny Kotkov
Evgeny Kotkov <[hidden email]> writes:

> I think that it would be nice to have this optimization in rep-cache.db,
> and that we can start using it in a compatible way:
>
>   - All existing rep-cache.db statements are compatible with it.
>
>   - Since SQLite versions prior to 3.8.2 don't support it, we would
>     only create the new tables with this optimization in fsfs format 8,
>     and simultaneously bump the minimal required SQLite version from
>     3.7.12 (May 2012) to 3.8.2 (December 2013).  This would ensure that
>     all binaries supporting format 8 can work with the tables with this
>     optimization.
>
> Would there be any objections to a change like this (see the attached patch)?

I committed the patch (with a couple of additional tweaks for the
recommended SQLite version and the build scripts) in

    https://svn.apache.org/r1817042


Thanks,
Evgeny Kotkov
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Bert Huijben-5
In reply to this post by Evgeny Kotkov


> -----Original Message-----
> From: Evgeny Kotkov [mailto:[hidden email]]
> Sent: donderdag 30 november 2017 17:45
> To: [hidden email]
> Subject: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-
> cache.db
>
> Hi all,
>
> The recent SQLite versions (starting from 3.8.2, released in December 2013)
> feature a `WITHOUT ROWID` optimization [1] that can be enabled when
> creating
> a table.  In short, it works well for tables that have non-integer primary
> keys, such as
>
>     name TEXT PRIMARY KEY
>
> by not maintaining the hidden rowid values and an another B-Tree to match
> between a primary key value and its rowid.  This reduces the on-disk size
> and makes the lookups faster (a key → rowid → data lookup is replaced with
> a key → data lookup).

It doesn't add another B-tree for the primary key and its rowids. For the primary key the main table is used as the index.

The case where things differ is when there are multiple indexes. In this case normal table will always refer to the primary key using the rowed, while for 'WITHOUT ROWID' there will be referred to the primary key, which in general is larger.

It really depends on the use case where this helps... or makes things worse... Certain optimizations inside SQLite are not available for such tables.


For this specific table I think this helps as we only use the primary key index, but I'm guessing it won't help us much on other tables.


When we bump the required SQLite version for the client we might want to update the scheme of wc.db to use a sparse index for the move info table. This index contains mostly NULL values, which SQLite can easily stop maintaining, improving write("UPDATE") speed on the NODES tables considerably.

        Bert

>
> Currently, the rep-cache.db schema uses a non-integer primary key:
>
>     hash TEXT NOT NULL PRIMARY KEY
>
> and can benefit from this optimization.  A quick experiment showed a
> reduction of the on-disk size of the database by ~1.75x.  The lookups
> should also be faster, both due to the reduced database size and due to
> the lesser amount of internal bsearches.  This should improve the times
> of new commits and `svnadmin load`, especially for large repositories
> that also have large rep-cache.db files.
>
> I think that it would be nice to have this optimization in rep-cache.db,
> and that we can start using it in a compatible way:
>
>   - All existing rep-cache.db statements are compatible with it.
>
>   - Since SQLite versions prior to 3.8.2 don't support it, we would
>     only create the new tables with this optimization in fsfs format 8,
>     and simultaneously bump the minimal required SQLite version from
>     3.7.12 (May 2012) to 3.8.2 (December 2013).  This would ensure that
>     all binaries supporting format 8 can work with the tables with this
>     optimization.
>
> Would there be any objections to a change like this (see the attached
> patch)?
>
> [1] https://sqlite.org/withoutrowid.html
>
>
> Thanks,
> Evgeny Kotkov

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Evgeny Kotkov
Bert Huijben <[hidden email]> writes:

>> The recent SQLite versions (starting from 3.8.2, released in December 2013)
>> feature a `WITHOUT ROWID` optimization [1] that can be enabled when
>> creating a table.  In short, it works well for tables that have non-integer
>> primary keys, such as
>>
>>     name TEXT PRIMARY KEY
>>
>> by not maintaining the hidden rowid values and an another B-Tree to match
>> between a primary key value and its rowid.  This reduces the on-disk size
>> and makes the lookups faster (a key → rowid → data lookup is replaced with
>> a key → data lookup).
>
> It doesn't add another B-tree for the primary key and its rowids. For the
> primary key the main table is used as the index.
>
> The case where things differ is when there are multiple indexes. In this
> case normal table will always refer to the primary key using the rowed,
> while for 'WITHOUT ROWID' there will be referred to the primary key,
> which in general is larger.

For the sake of finding out the truth :), I think that this contradicts the
explanation in https://sqlite.org/withoutrowid.html :

    CREATE TABLE IF NOT EXISTS wordcount(
      word TEXT PRIMARY KEY,
      cnt INTEGER
    );

    As an ordinary SQLite table, "wordcount" is implemented as two separate
    B-Trees. The main table uses the hidden rowid value as the key and stores
    the "word" and "cnt" columns as data. The "TEXT PRIMARY KEY" phrase of
    the CREATE TABLE statement causes the creation of an unique index on
    the "word" column. This index is a separate B-Tree that uses "word" and
    the "rowid" as the key and stores no data at all. Note that the
    complete text of every "word" is stored twice: once in the main
    table and again in the index.

Although I didn't check if the most recent SQLite version still behaves in
the described way internally, I have witnessed the described close-to-2x
reduction in the size of rep-cache.db — which, unless I am missing
something, follows the described idea of this optimization.


Thanks,
Evgeny Kotkov
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

Branko Čibej
On 08.12.2017 12:40, Evgeny Kotkov wrote:

> Bert Huijben <[hidden email]> writes:
>
>>> The recent SQLite versions (starting from 3.8.2, released in December 2013)
>>> feature a `WITHOUT ROWID` optimization [1] that can be enabled when
>>> creating a table.  In short, it works well for tables that have non-integer
>>> primary keys, such as
>>>
>>>     name TEXT PRIMARY KEY
>>>
>>> by not maintaining the hidden rowid values and an another B-Tree to match
>>> between a primary key value and its rowid.  This reduces the on-disk size
>>> and makes the lookups faster (a key → rowid → data lookup is replaced with
>>> a key → data lookup).
>> It doesn't add another B-tree for the primary key and its rowids. For the
>> primary key the main table is used as the index.
>>
>> The case where things differ is when there are multiple indexes. In this
>> case normal table will always refer to the primary key using the rowed,
>> while for 'WITHOUT ROWID' there will be referred to the primary key,
>> which in general is larger.
> For the sake of finding out the truth :), I think that this contradicts the
> explanation in https://sqlite.org/withoutrowid.html :
>
>     CREATE TABLE IF NOT EXISTS wordcount(
>       word TEXT PRIMARY KEY,
>       cnt INTEGER
>     );
>
>     As an ordinary SQLite table, "wordcount" is implemented as two separate
>     B-Trees. The main table uses the hidden rowid value as the key and stores
>     the "word" and "cnt" columns as data. The "TEXT PRIMARY KEY" phrase of
>     the CREATE TABLE statement causes the creation of an unique index on
>     the "word" column. This index is a separate B-Tree that uses "word" and
>     the "rowid" as the key and stores no data at all. Note that the
>     complete text of every "word" is stored twice: once in the main
>     table and again in the index.
>
> Although I didn't check if the most recent SQLite version still behaves in
> the described way internally, I have witnessed the described close-to-2x
> reduction in the size of rep-cache.db — which, unless I am missing
> something, follows the described idea of this optimization.

I suspect most of the performance improvement is a consequence of the
reduced database size, since it needs less page-in operations for the
same number of lookups. The B-tree search should really be trivial
compared to uncached I/O.

-- Brane