Skip to content

Commit

Permalink
Add missing MongoDB sharding configuration for version vectors (#1097)
Browse files Browse the repository at this point in the history
This commit adds the missing MongoDB sharding configuration for
versionvectors collection introduced in PR #1047, as well as updates
the related documentation to reflect these changes.
  • Loading branch information
hackerwins authored Dec 11, 2024
1 parent e3045dc commit c9a86db
Show file tree
Hide file tree
Showing 4 changed files with 61 additions and 41 deletions.
5 changes: 5 additions & 0 deletions build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ sharded:
- name: doc_id
method: "1"
unique: false
- collectionName: versionvectors
shardKeys:
- name: doc_id
method: "1"
unique: false

# Configuration for manual dmongodb sharded stack
mongodb-sharded:
Expand Down
59 changes: 33 additions & 26 deletions build/docker/sharding/scripts/init-mongos1.js
Original file line number Diff line number Diff line change
@@ -1,39 +1,46 @@
sh.addShard("shard-rs-1/shard1-1:27017")
sh.addShard("shard-rs-2/shard2-1:27017")
sh.addShard("shard-rs-1/shard1-1:27017");
sh.addShard("shard-rs-2/shard2-1:27017");

function findAnotherShard(shard) {
if (shard == "shard-rs-1") {
return "shard-rs-2"
} else {
return "shard-rs-1"
}
if (shard == "shard-rs-1") {
return "shard-rs-2";
} else {
return "shard-rs-1";
}
}

function shardOfChunk(minKeyOfChunk) {
return db.getSiblingDB("config").chunks.findOne({ min: { project_id: minKeyOfChunk } }).shard
return db
.getSiblingDB("config")
.chunks.findOne({ min: { project_id: minKeyOfChunk } }).shard;
}

// Shard the database for the mongo client test
const mongoClientDB = "test-yorkie-meta-mongo-client"
sh.enableSharding(mongoClientDB)
sh.shardCollection(mongoClientDB + ".clients", { project_id: 1 })
sh.shardCollection(mongoClientDB + ".documents", { project_id: 1 })
sh.shardCollection(mongoClientDB + ".changes", { doc_id: 1 })
sh.shardCollection(mongoClientDB + ".snapshots", { doc_id: 1 })
sh.shardCollection(mongoClientDB + ".syncedseqs", { doc_id: 1 })
const mongoClientDB = "test-yorkie-meta-mongo-client";
sh.enableSharding(mongoClientDB);
sh.shardCollection(mongoClientDB + ".clients", { project_id: 1 });
sh.shardCollection(mongoClientDB + ".documents", { project_id: 1 });
sh.shardCollection(mongoClientDB + ".changes", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".snapshots", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".syncedseqs", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".versionvectors", { doc_id: 1 });

// Split the inital range at `splitPoint` to allow doc_ids duplicate in different shards.
const splitPoint = ObjectId("500000000000000000000000")
sh.splitAt(mongoClientDB + ".documents", { project_id: splitPoint })
const splitPoint = ObjectId("500000000000000000000000");
sh.splitAt(mongoClientDB + ".documents", { project_id: splitPoint });
// Move the chunk to another shard.
db.adminCommand({ moveChunk: mongoClientDB + ".documents", find: { project_id: splitPoint }, to: findAnotherShard(shardOfChunk(splitPoint)) })
db.adminCommand({
moveChunk: mongoClientDB + ".documents",
find: { project_id: splitPoint },
to: findAnotherShard(shardOfChunk(splitPoint)),
});

// Shard the database for the server test
const serverDB = "test-yorkie-meta-server"
sh.enableSharding(serverDB)
sh.shardCollection(serverDB + ".clients", { project_id: 1 })
sh.shardCollection(serverDB + ".documents", { project_id: 1 })
sh.shardCollection(serverDB + ".changes", { doc_id: 1 })
sh.shardCollection(serverDB + ".snapshots", { doc_id: 1 })
sh.shardCollection(serverDB + ".syncedseqs", { doc_id: 1 })

const serverDB = "test-yorkie-meta-server";
sh.enableSharding(serverDB);
sh.shardCollection(serverDB + ".clients", { project_id: 1 });
sh.shardCollection(serverDB + ".documents", { project_id: 1 });
sh.shardCollection(serverDB + ".changes", { doc_id: 1 });
sh.shardCollection(serverDB + ".snapshots", { doc_id: 1 });
sh.shardCollection(serverDB + ".syncedseqs", { doc_id: 1 });
sh.shardCollection(serverDB + ".versionvectors", { doc_id: 1 });
36 changes: 22 additions & 14 deletions design/mongodb-sharding.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: mongodb-sharding
target-version: 0.4.14
target-version: 0.5.7
---

# MongoDB Sharding
Expand Down Expand Up @@ -29,11 +29,10 @@ This document will only explain the core concepts of the selected sharding strat

1. Cluster-wide: `users`, `projects`
2. Project-wide: `documents`, `clients`
3. Document-wide: `changes`, `snapshots`, `syncedseqs`
3. Document-wide: `changes`, `snapshots`, `syncedseqs`, `versionvectors`

<img src="media/mongodb-sharding-prev-relation.png">


**Sharding Goals**

Shard Project-wide and Document-wide collections due to the large number of data count in each collection
Expand All @@ -49,6 +48,7 @@ Shard Project-wide and Document-wide collections due to the large number of data
3. `Changes`: `(doc_id, server_seq)`
4. `Snapshots`: `(doc_id, server_seq)`
5. `Syncedseqs`: `(doc_id, client_id)`
6. `Versionvectors`: `(doc_id, client_id)`

**Main Query Patterns**

Expand All @@ -64,6 +64,7 @@ cursor, err := c.collection(ColClients).Find(ctx, bson.M{
},
}, options.Find().SetLimit(int64(candidatesLimit)))
```

```go
// Documents
filter := bson.M{
Expand Down Expand Up @@ -106,6 +107,7 @@ cursor, err := c.collection(colChanges).Find(ctx, bson.M{
},
}, options.Find())
```

```go
// Snapshots
result := c.collection(colSnapshots).FindOne(ctx, bson.M{
Expand All @@ -132,6 +134,7 @@ Every unique constraint can be satisfied because each has the shard key as a pre
3. `Changes`: `(doc_id, server_seq)`
4. `Snapshots`: `(doc_id, server_seq)`
5. `Syncedseqs`: `(doc_id, client_id)`
6. `Versionvectors`: `(doc_id, client_id)`

**Changes of Reference Keys**

Expand All @@ -142,6 +145,7 @@ Since the uniqueness of `_id` isn't guaranteed across shards, reference keys to
3. `Changes`: `_id` -> `(project_id, doc_id, server_seq)`
4. `Snapshots`: `_id` -> `(project_id, doc_id, server_seq)`
5. `Syncedseqs`: `_id` -> `(project_id, doc_id, client_id)`
6. `Versionvectors`: `_id` -> `(project_id, doc_id, client_id)`

Considering that MongoDB ensures the uniqueness of `_id` per shard, `Documents` and `Clients` can be identified with the combination of `project_id` and `_id`. Note that the reference keys of document-wide collections are also subsequently changed.

Expand All @@ -155,12 +159,12 @@ Considering that MongoDB ensures the uniqueness of `_id` per shard, `Documents`

For a production deployment, consider the following to ensure data redundancy and system availability.

* Config Server (3 member replica set): `config1`,`config2`,`config3`
* 3 Shards (each a 3 member replica set):
* `shard1-1`,`shard1-2`, `shard1-3`
* `shard2-1`,`shard2-2`, `shard2-3`
* `shard3-1`,`shard3-2`, `shard3-3`
* 2 Mongos: `mongos1`, `mongos2`
- Config Server (3 member replica set): `config1`,`config2`,`config3`
- 3 Shards (each a 3 member replica set):
- `shard1-1`,`shard1-2`, `shard1-3`
- `shard2-1`,`shard2-2`, `shard2-3`
- `shard3-1`,`shard3-2`, `shard3-3`
- 2 Mongos: `mongos1`, `mongos2`

![Cluster architecture](media/mongodb-sharding-cluster-arch.png)

Expand All @@ -178,11 +182,12 @@ Using a composite shard key like `(project_id, key)` can resolve this issue. Aft
However, this change of shard keys can lead to the value duplication of either `actor_id` or `owner`, which uses `client_id` as a value. Now the values of `client_id` can be duplicated, contrary to the current sharding strategy where locating every client in the same project into the same shard prevents this kind of duplications.

The duplication of `client_id` values can devastate the consistency of documents. There are three expected approaches to resolve this issue:

1. Use `client_key + client_id` as a value.
* this may increase the size of CRDT metadata and the size of document snapshots.
- this may increase the size of CRDT metadata and the size of document snapshots.
2. Introduce a cluster-level GUID generator.
3. Depend on the low possibility of duplication of MongoDB ObjectID.
* see details in the following contents.
- see details in the following contents.

**Duplicate MongoDB ObjectID**

Expand All @@ -192,17 +197,20 @@ Both `client_id` and `doc_id` are currently using MongoDB ObjectID as a value. W

However, the possibility of duplicate ObjectIDs is **extremely low in practical use cases** due to its mechanism.
ObjectID uses the following format:

```
TimeStamp(4 bytes) + MachineId(3 bytes) + ProcessId(2 bytes) + Counter(3 bytes)
```

The condition for duplicate ObjectIDs is that more than `16,777,216` documents/clients are created every single second in a single machine and process. Considering Google processes over `99,000` searches every single second, it is unlikely to occur.

When we have to meet that amount of traffic in the future, consider the following options:

1. Introduce a cluster-level GUID generator.
2. Disable auto-balancing chunks of documents and clients.
* Just isolate each shard for a single project.
- Just isolate each shard for a single project.

## References
* [Implementation of ObjectID generator in golang driver](https://github.com/mongodb/mongo-go-driver/blob/v0.0.18/bson/objectid/objectid.go#L40)
* [Generating globally unique identifiers for use with MongoDB](https://www.mongodb.com/blog/post/generating-globally-unique-identifiers-for-use-with-mongodb)

- [Implementation of ObjectID generator in golang driver](https://github.com/mongodb/mongo-go-driver/blob/v0.0.18/bson/objectid/objectid.go#L40)
- [Generating globally unique identifiers for use with MongoDB](https://www.mongodb.com/blog/post/generating-globally-unique-identifiers-for-use-with-mongodb)
2 changes: 1 addition & 1 deletion server/backend/database/mongo/indexes.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,8 +174,8 @@ var collectionInfos = []collectionInfo{
name: ColVersionVectors,
indexes: []mongo.IndexModel{{
Keys: bsonx.Doc{
{Key: "doc_id", Value: bsonx.Int32(1)}, // shard key
{Key: "project_id", Value: bsonx.Int32(1)},
{Key: "doc_id", Value: bsonx.Int32(1)},
{Key: "client_id", Value: bsonx.Int32(1)},
},
Options: options.Index().SetUnique(true),
Expand Down

0 comments on commit c9a86db

Please sign in to comment.