Skip to content

integration: use NTS by default #292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 7, 2025

Conversation

muzarski
Copy link
Contributor

@muzarski muzarski commented May 5, 2025

For the tests with single DC clusters, we would always use SimpleStrategy. This PR addresses this, and changes it so we use NetworkTopologyStrategy for Scylla clusters. For Cassandra clusters, we will still use SimpleStrategy - it's because Cassandra 3.11.19 (the version we currently test against) does not allow replication_factor option with NTS.

For SimpleStrategy keyspaces, Scylla disables the tablets by default. When we transition to NTS, some tests start to fail - it is expected - some features are currently not working with tablets. In this PR, we disable the tablets for tests which:

  1. use counters
  2. use LWTs
  3. use materialized views
  4. try to decomission a node (ccm decomission seems to fail for nodes with tablets keyspaces - the exact error is included in corresponding commit's message).

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
  • I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

@muzarski muzarski marked this pull request as draft May 5, 2025 06:26
@muzarski muzarski added this to the 0.5 milestone May 5, 2025
@muzarski muzarski self-assigned this May 5, 2025
@muzarski muzarski added the area/testing Related to unit/integration testing label May 5, 2025
@muzarski muzarski force-pushed the use-nts-by-default branch 3 times, most recently from 9fe9dee to ec98623 Compare May 5, 2025 14:21
@muzarski muzarski requested a review from Lorak-mmk May 5, 2025 15:10
@muzarski muzarski marked this pull request as ready for review May 5, 2025 15:10
@Lorak-mmk
Copy link
Collaborator

Regarding "it: disable tablets for ConsistencyThreeNodeClusterTests ":
What is the replication factor in tests, and how many nodes are there?
Imo the decomission error message suggests that there would be too few nodes left to satisfy replication factor - in which case we need to change the test, or the cluster used in tests.

Comment on lines 271 to 273
// Currently we use Cassandra 3.11.19, which disallows the use of `replication_factor` with NTS.
std::string strategy = Options::is_scylla() ? "'NetworkTopologyStrategy'" : "'SimpleStrategy'";
replication_strategy_s << strategy << ", 'replication_factor': ";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error message when trying to use replication_factor with Cassandra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Configuration error: Database returned an error: The query is invalid because of some configuration issue, Error message: replication_factor is an option for SimpleStrategy, not NetworkTopologyStrategy"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik using 'replication_factor': n is the same as specifying this same RF for each DC.
I think in this branch there is only one DC, right? In which case 'dc1': n should work.

Copy link
Contributor Author

@muzarski muzarski May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but I'm not sure how ccm works in case of creating a cluster with single dc - i.e. I don't know what's the default name for this single dc (if any). I'll need to check.

What I know for sure is that if there are multiple dcs, ccm calls them dc1, dc2, ..., dc_n.

@muzarski
Copy link
Contributor Author

muzarski commented May 6, 2025

Imo the decomission error message suggests that there would be too few nodes left to satisfy replication factor - in which case we need to change the test, or the cluster used in tests.

Assuming this is true (I haven't checked the values you asked for yet), why we can't satisfy repfactor for tablets, but we could do that for vnodes?

@Lorak-mmk
Copy link
Collaborator

Assuming this is true (I haven't checked the values you asked for yet), why we can't satisfy repfactor for tablets, but we could do that for vnodes?

You couldn't, but the DB did not verify that. For example, it was possible to create a keyspace with RF=4 on a 3-node cluster (but there would not be 4 copies of data).
With tablets such verifications are now in place, so the previously wrong behaviors are now rejected explicitly.

@muzarski
Copy link
Contributor Author

muzarski commented May 6, 2025

You couldn't, but the DB did not verify that. For example, it was possible to create a keyspace with RF=4 on a 3-node cluster (but there would not be 4 copies of data). With tablets such verifications are now in place, so the previously wrong behaviors are now rejected explicitly.

Makes sense, thanks. I'll investigate the ConsistencyThreeNodeCluster test suite further then.

muzarski added 5 commits May 7, 2025 09:27
Some features are not working with tablets enabled - for example LWTs.
Tests using such features should explicitly disable tablets (if running
against Scylla that supports tablets).
Almost all schema metadata tests (except for one) do some checks
regarding the materialized view metadata. Let's disable the tablets
for this suite.
I believe this is cleaner - we know which tests are failing by looking at the filter.

This also enables `TableMetadataColumnOrder` test which I implemented a while ago....
I missed it previously.
They are using LWTs statements, which do not work with tablets.
@muzarski
Copy link
Contributor Author

muzarski commented May 7, 2025

Regarding ConsistencyThreeNodeTests: As the name suggests, it starts with 3 nodes cluster.

It looks like this suite explicitly sets the RF to 3. We want to keep RF=3 on the keyspace, decommission some node(s), and expect the errors on higher consistency levels (such as ALL or THREE).

Since Scylla disallows us to decommission the node (which makes sense), we can simply rewrite the test to use ccm stop instead of ccm decommission - this would simulate a scenario when node crashes (and it does not cleanly migrate its data to other nodes, as opposed to ccm decommission).

Example test case (OneNodeDecommissioned):

  // Perform a sanity check against a full healthy cluster (N=3, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_ALL);
  select_.set_consistency(CASS_CONSISTENCY_ALL);
  session_.execute(insert_);
  session_.execute(select_);

  // Decommission node two
  decommission_node(2);

  // Perform a check using consistency `QUORUM` (N=2, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_QUORUM);
  select_.set_consistency(CASS_CONSISTENCY_QUORUM);
  session_.execute(insert_);
  session_.execute(select_);

  // Perform a check using consistency `ONE` (N=2, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_ONE);
  select_.set_consistency(CASS_CONSISTENCY_ONE);
  session_.execute(insert_);
  session_.execute(select_);

  // Perform a check using consistency `TWO` (N=2, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_TWO);
  select_.set_consistency(CASS_CONSISTENCY_TWO);
  session_.execute(insert_);
  session_.execute(select_);

  // Perform a check using consistency `ALL` (should fail N=2, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_ALL);
  select_.set_consistency(CASS_CONSISTENCY_ALL);
  ASSERT_NE(CASS_OK, session_.execute(insert_, false).error_code());
  ASSERT_NE(CASS_OK, session_.execute(select_, false).error_code());

  // Perform a check using consistency `THREE` (should fail N=2, RF=3)
  insert_.set_consistency(CASS_CONSISTENCY_THREE);
  select_.set_consistency(CASS_CONSISTENCY_THREE);
  ASSERT_NE(CASS_OK, session_.execute(insert_, false).error_code());
  ASSERT_NE(CASS_OK, session_.execute(select_, false).error_code());

We can simply replace decommission_node() call with stop_node - the test then passes for me locally. WDYT? @Lorak-mmk @wprzytula

@Lorak-mmk
Copy link
Collaborator

Makes sense

@muzarski muzarski force-pushed the use-nts-by-default branch from ec98623 to 8facc9f Compare May 7, 2025 08:24
@muzarski
Copy link
Contributor Author

muzarski commented May 7, 2025

Rebased on master

muzarski added 4 commits May 7, 2025 10:26
This one is interesting - it looks like `ccm decomission` command fails
for Scylla with tablets. I'm including the error message:

```
Subprocess /home/runner/.ccm/cpp-driver_2025-1-2_3-0/node2/bin/scylla nodetool -h 127.0.0.2 -p 10000 decommission exited with non-zero status; exit status: 4;
stderr: error executing POST request to http://127.0.0.2:10000/storage_service/decommission with parameters {}: remote replied with status code 500 Internal Server Error:
std::runtime_error (Decommission failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (Unable to find new replica for tablet baf48a70-297b-11f0-ba85-f2994cdb5698:1 on 9394531c-5a8b-46f3-b236-db5163ceaccc:1 when draining {9394531c-5a8b-46f3-b236-db5163ceaccc}. Consider adding new nodes or reducing replication factor. (nodes [0ae42828-9d81-4e67-9b61-a3914bc81d5b, 7734cef8-85b4-45f8-87d1-f89ae4c51f39], replicas [7734cef8-85b4-45f8-87d1-f89ae4c51f39:1, 0ae42828-9d81-4e67-9b61-a3914bc81d5b:1, 9394531c-5a8b-46f3-b236-db5163ceaccc:1]))). Request ID: bb5bfa8e-297b-11f0-a230-53d049d48e29)
```

This makes sense, because the test uses 3 nodes, with RF=3 - the decomission
should fail, because otherwise we are not able to satisfy RF. Instead, we can
stop the node so it does not cleanly migrate its data to other nodes (simulating its crash).
It uses counters - they are not supported with tablets.

Note: This also disables tablets for BatchCounterThreeNodeClusterTests,
which use counters as well.
This is basically the same test suite, but with tablets disabled.
Counters are not supported with tablets.
Both test cases use LWTs (though one of them is not enabled yet).
@muzarski muzarski force-pushed the use-nts-by-default branch from 8facc9f to 8c4966c Compare May 7, 2025 08:37
@muzarski
Copy link
Contributor Author

muzarski commented May 7, 2025

v1.1:

  • using ccm stop instead of ccm decommission in ConsistencyThreeNodeClusterTests
  • using NTS for cassandra as well. The default dc name for single-dc cluster is dc1, so we can just do 'class': 'NetworkTopologyStrategy', 'dc1': <rf>

@muzarski muzarski requested a review from Lorak-mmk May 7, 2025 08:44
@wprzytula
Copy link
Collaborator

Hmm...

[ FAILED ] ExecutionProfileTest.Integration_Cassandra_RequestTimeout

@muzarski
Copy link
Contributor Author

muzarski commented May 7, 2025

Hmm...

[ FAILED ] ExecutionProfileTest.Integration_Cassandra_RequestTimeout

      Expected: CASS_ERROR_LIB_REQUEST_TIMED_OUT
      Which is: 16777230
To be equal to: result.error_code()
      Which is: 33563136

Value 335... corresponds to CASS_ERROR_SERVER_INVALID_QUERY. I wonder what exact error server returned..

By default, the tests used SimpleStrategy if there was a single DC.
Using SimpleStrategy is discouraged. Apart from that, Scylla seems to
disable the tablets by default for SimpleStrategy keyspaces - we want
to prevent that.

For Cassandra, we still will be using SimpleStrategy.
@muzarski muzarski force-pushed the use-nts-by-default branch from 8c4966c to 5287fa4 Compare May 7, 2025 09:04
@muzarski
Copy link
Contributor Author

muzarski commented May 7, 2025

Hmm...

[ FAILED ] ExecutionProfileTest.Integration_Cassandra_RequestTimeout

I've run this single test case 100 times in a loop locally against Scylla 5.4.8 - the error did not reproduce.

@muzarski muzarski requested review from Lorak-mmk and wprzytula May 7, 2025 09:23
@muzarski muzarski merged commit ecc3310 into scylladb:master May 7, 2025
12 checks passed
@muzarski muzarski deleted the use-nts-by-default branch May 7, 2025 12:51
@wprzytula wprzytula mentioned this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Related to unit/integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants