verify_as_revision_before_current_plus_plus() on a production repo?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

verify_as_revision_before_current_plus_plus() on a production repo?

Julian Foad-5
Would this be safe?

In subversion/libsvn_fs_fs/transaction.c, just before commit_body()
bumps the revision number in the "current" file, it calls
verify_as_revision_before_current_plus_plus(), quoted below.

verify_as_revision_before_current_plus_plus() is currently compiled in
to debug builds but not to release builds. We can say therefore it gets
reasonable coverage in test suite runs but has had little or no
real-world testing.

In a customer's production system, two recent and very similar commits
each resulted in a corruption ("Checksum mismatch while reading
representation...") being reported in post-commit processing in
dav_svn__merge_response().

I want to recommend that the customer rebuilds their Subversion (1.9.4)
release build with this "verify" code enabled, and runs it in their
production server, in order to try to prevent further corruptions from
entering the repository. It seems very possible that this may block such
a broken commit if and when it happens again.

Please could anybody cast a second pair of eyes over this code and say
whether it looks safe to enable in production. It looks low risk to me,
but it is a little "tricky": in particular, it opens a second FS
instance and fakes the "youngest revision seen" field and then relies on
this FS instance never reading the "current" file nor using anything
that would have been done post-commit. It also doesn't pass any of the
configured FS options to this second instance, which looks OK for the
time being but not future-proof.

Thanks...
- Julian


[[[
/* Open a new svn_fs_t handle to FS, set that handle's concept of
    "current youngest revision" to NEW_REV, and call
    svn_fs_fs__verify_root() on NEW_REV's revision root.

    Intended to be called as the very last step in a commit before
    'current' is bumped.  This implies that we are holding the write
    lock. */
static svn_error_t *
verify_as_revision_before_current_plus_plus(svn_fs_t *fs,
                                             svn_revnum_t new_rev,
                                             apr_pool_t *pool)
{
#ifdef SVN_DEBUG
   fs_fs_data_t *ffd = fs->fsap_data;
   svn_fs_t *ft; /* fs++ == ft */
   svn_fs_root_t *root;
   fs_fs_data_t *ft_ffd;
   apr_hash_t *fs_config;

   SVN_ERR_ASSERT(ffd->svn_fs_open_);

   /* make sure FT does not simply return data cached by other instances
    * but actually retrieves it from disk at least once.
    */
   fs_config = apr_hash_make(pool);
   svn_hash_sets(fs_config, SVN_FS_CONFIG_FSFS_CACHE_NS,
                            svn_uuid_generate(pool));
   SVN_ERR(ffd->svn_fs_open_(&ft, fs->path,
                             fs_config,
                             pool,
                             pool));
   ft_ffd = ft->fsap_data;
   /* Don't let FT consult rep-cache.db, either. */
   ft_ffd->rep_sharing_allowed = FALSE;

   /* Time travel! */
   ft_ffd->youngest_rev_cache = new_rev;

   SVN_ERR(svn_fs_fs__revision_root(&root, ft, new_rev, pool));
   SVN_ERR_ASSERT(root->is_txn_root == FALSE && root->rev == new_rev);
   SVN_ERR_ASSERT(ft_ffd->youngest_rev_cache == new_rev);
   SVN_ERR(svn_fs_fs__verify_root(root, pool));
#endif /* SVN_DEBUG */

   return SVN_NO_ERROR;
}

[...]

commit_body(void *baton, apr_pool_t *pool)
{
   [...]
   [... Move the finished rev file into place]
   [... Write final revprops file]

   /* Update the 'current' file. */
   SVN_ERR(verify_as_revision_before_current_plus_plus(cb->fs, new_rev,
pool));
   SVN_ERR(write_final_current(cb->fs, txn_id, new_rev, start_node_id,
                               start_copy_id, pool));
   [...]
}
]]]
Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Daniel Shahaf-2
Julian Foad wrote on Mon, Jan 30, 2017 at 21:40:00 +0000:
> Please could anybody cast a second pair of eyes over this code and say
> whether it looks safe to enable in production. It looks low risk to me, but
> it is a little "tricky": in particular, it opens a second FS instance and
> fakes the "youngest revision seen" field and then relies on this FS instance
> never reading the "current" file

Right: if some code refreshes youngest_rev_cache (by open()ing
'current', reading a value from it, and setting youngest_rev_cache to
that value), that will cause root->rev to be "newer than youngest".

I think the 'verify' code already has to deal with this possibility,
since 'recover' can backdate 'current' in the following situation:
.
    1. svn_fs.h consumer opens an svn_fs_root_t for r42
    2. invisible monkeys delete db/revs/0/42 and db/current
    3. admin runs 'svnadmin recover', which regenerates 'current' as 41
    4. svn_fs.h consumer calls verify() on the root it had opened earlier

Moreover, the 'verify' code is inherently the place where violated
invariants are least likely to cause trouble, and it's read-only.
Therefore, while a bug might cause a false positive verification error
that rejects a commit, I don't see any worse outcome.  (If there's
a failure mode here that I overlooked, it's most likely to be in the f7
code, since I haven't worked much with those parts of fsfs.)

All that said, I agree that checking after the _verify_root() call that
root->rev and youngest_rev_cache haven't changed would be an improvement.

(Historical note: I think the function was written on the assumption
that youngest_rev_cache is advanced by assigning MAX(youngest_rev_cache,
value_read_from_disk) to it… but that assumption, regardless of whether
it was true when the function was written, is not true today; nowadays
the code just assigns "youngest_rev_cache = value_read_from_disk"
unconditionally.)

> nor using anything that would have been done post-commit.

That's a good point: the svn_fs_fs__verify_root() must not add any
permanent references to the revision it thinks is youngest — e.g., it
must not add reps it traverses to rep-cache.db.  That's true today but
not necessarily true forever.

>

I also wonder if having verify_as_revision_before_current_plus_plus()
run in a child process would gain anything.

> verify_as_revision_before_current_plus_plus() is currently compiled in to
> debug builds but not to release builds. We can say therefore it gets
> reasonable coverage in test suite runs but has had little or no real-world
> testing.

Indeed.  How about enabling that function in the alpha1 release so we
can get some more feedback about it?

(CC'ing stsp)

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Julian Foad-5
Daniel, thanks for your comments.

Daniel Shahaf wrote:

> Julian Foad wrote on Mon, Jan 30, 2017 at 21:40:00 +0000:
>> it is a little "tricky": in particular, it opens a second FS instance and
>> fakes the "youngest revision seen" field and then relies on this FS instance
>> never reading the "current" file
>
> Right: if some code refreshes youngest_rev_cache (by open()ing
> 'current', reading a value from it, and setting youngest_rev_cache to
> that value), that will cause root->rev to be "newer than youngest".
>
> I think the 'verify' code already has to deal with this possibility,
> since 'recover' can backdate 'current' in the following situation:
> .
>     1. svn_fs.h consumer opens an svn_fs_root_t for r42
>     2. invisible monkeys delete db/revs/0/42 and db/current
>     3. admin runs 'svnadmin recover', which regenerates 'current' as 41
>     4. svn_fs.h consumer calls verify() on the root it had opened earlier

I don't buy this. verify_root() is not required to successfully verify
r42 in this scenario.

> Moreover, the 'verify' code is inherently the place where violated
> invariants are least likely to cause trouble, and it's read-only.
> Therefore, while a bug might cause a false positive verification error
> that rejects a commit, I don't see any worse outcome.  (If there's
> a failure mode here that I overlooked, it's most likely to be in the f7
> code, since I haven't worked much with those parts of fsfs.)
>
> All that said, I agree that checking after the _verify_root() call that
> root->rev and youngest_rev_cache haven't changed would be an improvement.
[...]
>
>> nor using anything that would have been done post-commit.
>
> That's a good point: the svn_fs_fs__verify_root() must not add any
> permanent references to the revision it thinks is youngest — e.g., it
> must not add reps it traverses to rep-cache.db.  That's true today but
> not necessarily true forever.

Interesting observation. It would be good the verify could use a
read-only FS instance, but I don't think we have such a mechanism.

I meant the opposite direction: a successful commit does some things
post-commit (e.g. remove the txn directory, update the fulltext cache,
update the rep cache) and this verify must not assume any of those
things have been done already.

> I also wonder if having verify_as_revision_before_current_plus_plus()
> run in a child process would gain anything.
>
>> verify_as_revision_before_current_plus_plus() is currently compiled in to
>> debug builds but not to release builds. We can say therefore it gets
>> reasonable coverage in test suite runs but has had little or no real-world
>> testing.
>
> Indeed.  How about enabling that function in the alpha1 release so we
> can get some more feedback about it?

I like that idea: it seems like entirely appropriate behaviour for an
alpha or beta release, and we'd probably get no direct feedback and this
would be a good sign that it's working ok.

- Julian

Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Daniel Shahaf-2
Julian Foad wrote on Tue, Jan 31, 2017 at 10:22:06 +0000:

> Daniel, thanks for your comments.
>
> Daniel Shahaf wrote:
> >Julian Foad wrote on Mon, Jan 30, 2017 at 21:40:00 +0000:
> >>it is a little "tricky": in particular, it opens a second FS instance and
> >>fakes the "youngest revision seen" field and then relies on this FS instance
> >>never reading the "current" file
> >
> >Right: if some code refreshes youngest_rev_cache (by open()ing
> >'current', reading a value from it, and setting youngest_rev_cache to
> >that value), that will cause root->rev to be "newer than youngest".
> >
> >I think the 'verify' code already has to deal with this possibility,
> >since 'recover' can backdate 'current' in the following situation:
> >.
> >    1. svn_fs.h consumer opens an svn_fs_root_t for r42
> >    2. invisible monkeys delete db/revs/0/42 and db/current
> >    3. admin runs 'svnadmin recover', which regenerates 'current' as 41
> >    4. svn_fs.h consumer calls verify() on the root it had opened earlier
>
> I don't buy this. verify_root() is not required to successfully verify r42
> in this scenario.

I agree with your last sentence, but I think you misunderstood me.
I was simply arguing that the "root->rev is younger than value in
'current' on disk" is a situation that can happen in normal API usage,
and therefore calling svn_fs_verify_root() in that situation is *not*
undefined behaviour.  That is: the call may either succeed or error out,
but may not assert or segfault (or grow the proverbial toaster's arm).

> >Moreover, the 'verify' code is inherently the place where violated
> >invariants are least likely to cause trouble, and it's read-only.
> >Therefore, while a bug might cause a false positive verification error
> >that rejects a commit, I don't see any worse outcome.  (If there's
> >a failure mode here that I overlooked, it's most likely to be in the f7
> >code, since I haven't worked much with those parts of fsfs.)
> >
> >All that said, I agree that checking after the _verify_root() call that
> >root->rev and youngest_rev_cache haven't changed would be an improvement.
> [...]
> >
> >>nor using anything that would have been done post-commit.
> >
> >That's a good point: the svn_fs_fs__verify_root() must not add any
> >permanent references to the revision it thinks is youngest — e.g., it
> >must not add reps it traverses to rep-cache.db.  That's true today but
> >not necessarily true forever.
>
> Interesting observation. It would be good the verify could use a read-only
> FS instance, but I don't think we have such a mechanism.

Not today, but it would be easy enough to add to fs_fs_data_t a boolean
that tells with_some_lock() to just error out, and to have rep-cache.c
respect that boolean as well.  (See also the next bullet)

> I meant the opposite direction: a successful commit does some things
> post-commit (e.g. remove the txn directory, update the fulltext cache,
> update the rep cache) and this verify must not assume any of those things
> have been done already.

Taking these one by one:

- remove the txn directory: this wouldn't happen if the committing
  process is SIGKILLed just after bumping 'current', so 'verify' already
  has to deal with this.

  In theory, there could be a bug the other way here: if verify()
  removed that "orphaned" txn dir, the original commit process — which
  is "on hold" pending verify() returning — might try to rmdir() that
  directory when it resumes, and fail with ENOENT.  This specific
  situation would have been caught by 'make check' in maintainer mode,
  of course, but in general, that's another way in which we rely on
  'verify' being a read-only operation.

- update the fulltext cache: the second FS handle already runs with
  empty caches (see the svn_uuid_generate() call).

- update the rep cache: verify_as_revision_before_current_plus_plus() is
  called before rep-cache.db is updated (bumping 'current' is
  a fence separating those two), and in any case, 'verify' already has
  to deal with rep-cache entries being absent in case a revision had
  been committed while enable-rep-caching=false (in fsfs.conf) was in
  effect.

So that's these three.  Is there anything else that's updated between
bumping 'current' and returning control to svn_fs_commit_txn()'s caller?
Or some kind of process-global or machine-global cache shared between
the two FS handles, despite the different cache namespaces?

> >I also wonder if having verify_as_revision_before_current_plus_plus()
> >run in a child process would gain anything.
> >
> >>verify_as_revision_before_current_plus_plus() is currently compiled in to
> >>debug builds but not to release builds. We can say therefore it gets
> >>reasonable coverage in test suite runs but has had little or no real-world
> >>testing.
> >
> >Indeed.  How about enabling that function in the alpha1 release so we
> >can get some more feedback about it?
>
> I like that idea: it seems like entirely appropriate behaviour for an alpha
> or beta release, and we'd probably get no direct feedback and this would be
> a good sign that it's working ok.

Exactly my thinking.

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Julian Foad-5
Daniel Shahaf wrote:
> I agree with your last sentence, but I think you misunderstood me.
> I was simply arguing that the "root->rev is younger than value in
> 'current' on disk" is a situation that can happen in normal API usage,
> and therefore calling svn_fs_verify_root() in that situation is *not*
> undefined behaviour.  That is: the call may either succeed or error out,
> but may not assert or segfault (or grow the proverbial toaster's arm).

OK. To run this in production, it also must not error out in a normal
commit. I haven't fully traced the verify code to see whether it might
call svn_fs_fs__read_current().

> Julian Foad wrote:
>> Interesting observation. It would be good the verify could use a read-only
>> FS instance, but I don't think we have such a mechanism.
>
> Not today, but it would be easy enough to add to fs_fs_data_t a boolean
> that tells with_some_lock() to just error out, and to have rep-cache.c
> respect that boolean as well.  (See also the next bullet)

Maybe we should propose this as an enhancement.

>> I meant the opposite direction: a successful commit does some things
>> post-commit (e.g. remove the txn directory, update the fulltext cache,
>> update the rep cache) and this verify must not assume any of those things
>> have been done already.
>
> Taking these one by one:
[...]
> So that's these three.  Is there anything else that's updated between
> bumping 'current' and returning control to svn_fs_commit_txn()'s caller?

No.

> Or some kind of process-global or machine-global cache shared between
> the two FS handles, despite the different cache namespaces?

Not sure.

Nevertheless everything we discussed here gives me some confidence. Thanks.

- Julian

Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Julian Foad-5
We seem to have consensus that enabling

   verify_as_revision_before_current_plus_plus()

in production should be safe. It seems like it should be a useful extra
protection against bug-induced repository corruption. For users who have
experienced repository corruption and suspect it is bug-induced, or who
are more concerned about this than about maximizing commit throughput,
enabling such code would seem to be prudent.

We have long suggested running 'svnadmin verify' on each committed
revision. Is it because of a policy that we currently ship a
configuration which by default doesn't do this, favouring speed at the
expense of consistency checking? Or, when such checking is technically
available, should our policy be to provide administrative choice?

As I mentioned, one WANdisco customer recently experienced bug-induced
corruption twice in quick succession, and we were unable to isolate the
cause. I advised WANdisco to make a special build, enabling this, for
this customer. However, it is unsatisfactory to ship different builds
depending on whether a customer currently prefers more speed or more
reliability.

While I don't support adding too many config knobs, a choice like this
is something that the software cannot reasonably make by itself, so this
is a case where a new knob is justified, in my opinion.

Alternatively, given that Subversion's number one priority is supposed
to be reliability, why would we not enable this verification
permanently? The argument would, I suppose, be that it is unnecessary.
Yet we don't have convincing data to support that. Does anyone support
enabling this permanently? I ask mainly to make us think about it. The
easy answer is we don't want to reduce commit speed this much (how
much?) without evidence of need.

Another reason to make it configurable is that then if a user finds
corruption, like this customer did, they could quickly enable it and
repeat a similar commit and discover whether it does in fact protect
against that use case. In the case of the customer mentioned above, we
lost the opportunity to do that because they had changed their operating
procedures and moved on for weeks before we were able to get a build
with this option enabled.

Thoughts, please?

- Julian


Reply | Threaded
Open this post in threaded view
|

Re: verify_as_revision_before_current_plus_plus() on a production repo?

Daniel Shahaf-2
Julian Foad wrote on Mon, Mar 13, 2017 at 11:27:32 +0000:
> We seem to have consensus that enabling
>
>   verify_as_revision_before_current_plus_plus()
>
> in production should be safe. It seems like it should be a useful extra
> protection against bug-induced repository corruption.

I think we're as confident about this code as we can be about code that
hasn't been used in production.

> For users who have experienced repository corruption and suspect it is
> bug-induced, or who are more concerned about this than about
> maximizing commit throughput, enabling such code would seem to be
> prudent.

+1

> should our policy be to provide administrative choice?

> Thoughts, please?

A knob sounds good to me: both because this is a new codepath (the FS
layer should be conservative) and because we don't know everyone's use
patterns.

The usual problem with knobs is that they mandate maintaining multiple
codepaths; this consideration does not apply to a knob that simply
enables/disables a sanity check.

If we enable this outside maintainer mode, we should look into the
points raised upthread about ensuring that 'verify' is read-only.
We don't have to fix everything, but we should at least add roadsign
comments in the code to reduce the chance that future changes to verify
will interact badly with the unusual callsite in verify_as_revision_before_current_plus_plus().

> - Julian

By the way, are you planning to enable verify_as_revision_before_current_plus_plus()
in alpha3?  I would suggest enabling it unconditionally before alpha3,
then reverting it [or adding the knob under discussion] at some point
before branching 1.10.x.

Cheers,

Daniel
[We should probably give that function a shorter name... ;)]
Reply | Threaded
Open this post in threaded view
|

[PATCH] Re: verify_as_revision_before_current_plus_plus() on a production repo?

Julian Foad-5
Daniel Shahaf wrote:
>> should our policy be to provide administrative choice?
>
> A knob sounds good to me [...]

And Bert said "+1" on IRC too.

The attached patch is a basic implementation. The fsfs.conf option is
spelled

   [debug]
   verify-before-commit = true

I put it under [debug] because that section name already exists and
holds a "pack after commit" option (not documented in the file template
comments) which seems conceptually similar.

What sort of testing do you think this needs? A first level could be to
test that the verification code runs when the option is explicitly
enabled and doesn't when disabled, and preferably test the default state
too. A second level could be to test that the verification actually
picks up some (injected) corruption and prevents the commit and exits
cleanly. This second level is not directly related to making the feature
optional, but it is related to providing the feature at all in a
release-mode build.


> If we enable this outside maintainer mode, we should look into the
> points raised upthread about ensuring that 'verify' is read-only.
> We don't have to fix everything, but we should at least add roadsign
> comments in the code to reduce the chance that future changes to verify
> will interact badly with the unusual callsite in verify_as_revision_before_current_plus_plus().

Ack.

> By the way, are you planning to enable verify_as_revision_before_current_plus_plus()
> in alpha3?  I would suggest enabling it unconditionally before alpha3,
> then reverting it [or adding the knob under discussion] at some point
> before branching 1.10.x.

Sounds good. To keep track of the need to change the default back again,
would a bold coloured box in the release notes would be a suitable place
for such a note?

> [We should probably give that function a shorter name... ;)]

Could do. verify_before_commit?

- Julian

verify-before-commit-option-2.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: verify_as_revision_before_current_plus_plus() on a production repo?

Daniel Shahaf-2
Julian Foad wrote on Thu, Mar 16, 2017 at 11:50:34 +0000:
> What sort of testing do you think this needs? A first level could be to test
> that the verification code runs when the option is explicitly enabled and
> doesn't when disabled, and preferably test the default state too. A second
> level could be to test that the verification actually picks up some
> (injected) corruption and prevents the commit and exits cleanly. This second
> level is not directly related to making the feature optional, but it is
> related to providing the feature at all in a release-mode build.

Another thing we could test: 'make check' with the knob enabled.  I'd
lean to having that use 'svnserve -T' or a threaded MPM, in order to
increase the chances of catching possible bugs related to process-global
state being inappropriately modified by the "Time travel" svn_fs_t
handle.

> Daniel Shahaf wrote:
> >By the way, are you planning to enable verify_as_revision_before_current_plus_plus()
> >in alpha3?  I would suggest enabling it unconditionally before alpha3,
> >then reverting it [or adding the knob under discussion] at some point
> >before branching 1.10.x.
>
> Sounds good. To keep track of the need to change the default back again,
> would a bold coloured box in the release notes would be a suitable place for
> such a note?

I was thinking of an issue with milestone 1.10.0, but anything we'll run
into before cutting rc1 would be fine.

> >[We should probably give that function a shorter name... ;)]
>
> Could do. verify_before_commit?

Sure.

> @@ -1017,12 +1028,19 @@ write_config(svn_fs_t *fs,
> +""                                                                           NL
> +"[" CONFIG_SECTION_DEBUG "]"                                                 NL
> +"###"                                                                        NL
> +"### Whether to verify each new revision immediately before finalizing"      NL
> +"### the commit. The default is false in release-mode builds, and true"      NL
> +"### in debug-mode builds."                                                  NL
> +"# " CONFIG_OPTION_VERIFY_BEFORE_COMMIT " = false"                           NL

The phrase "debug-mode builds" isn't quite accurate: on Unix, SVN_DEBUG
is enabled by --enable-maintainer-mode but not by --enable-debug.
(On windows, there's no distinction between maintainer and debug modes,
and SVN_DEBUG is defined in that case.)

I'm not sure whether we should change the comment to match the code, or
vice-versa.

Cheers,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: verify_as_revision_before_current_plus_plus() on a production repo?

Stefan Fuhrmann-3
In reply to this post by Julian Foad-5
On 16.03.2017 12:50, Julian Foad wrote:

> Daniel Shahaf wrote:
>>> should our policy be to provide administrative choice?
>>
>> A knob sounds good to me [...]
>
> And Bert said "+1" on IRC too.
>
> The attached patch is a basic implementation. The fsfs.conf option is
> spelled
>
>   [debug]
>   verify-before-commit = true
>
> I put it under [debug] because that section name already exists and
> holds a "pack after commit" option (not documented in the file
> template comments) which seems conceptually similar.
>
> What sort of testing do you think this needs? A first level could be
> to test that the verification code runs when the option is explicitly
> enabled and doesn't when disabled, and preferably test the default
> state too. A second level could be to test that the verification
> actually picks up some (injected) corruption and prevents the commit
> and exits cleanly. This second level is not directly related to making
> the feature optional, but it is related to providing the feature at
> all in a release-mode build.
>
>
>> If we enable this outside maintainer mode, we should look into the
>> points raised upthread about ensuring that 'verify' is read-only.
>> We don't have to fix everything, but we should at least add roadsign
>> comments in the code to reduce the chance that future changes to verify
>> will interact badly with the unusual callsite in
>> verify_as_revision_before_current_plus_plus().
>
> Ack.
>
>> By the way, are you planning to enable
>> verify_as_revision_before_current_plus_plus()
>> in alpha3?  I would suggest enabling it unconditionally before alpha3,
>> then reverting it [or adding the knob under discussion] at some point
>> before branching 1.10.x.
>
> Sounds good. To keep track of the need to change the default back
> again, would a bold coloured box in the release notes would be a
> suitable place for such a note?
>
>> [We should probably give that function a shorter name... ;)]
>
> Could do. verify_before_commit?

I applied the patch, including the rename, in r1795351.

It adds 50% to our test suite (2:10 instead of 1:26).
If we were to enable it by default in release mode,
we need to add a feature to disable it for mass ops
like 'svnadmin import'.

-- Stefan^2.