Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds initial support for external attachments using the Filesystem or MinIO #1320

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented Nov 24, 2024

This PR is still being worked on, but open to initial review.

Context

When attachments are uploaded to Grist, they're stored in an SQLite table within the document.

The result is that large files (multi-GB) substantially increase the size of the SQLite file, and cause performance problems (for example, with snapshots).

Proposed solution

The current design / full scope of changes can be seen in this Google document..

This PR contains a minimal proof of concept for external attachments, with the following features:

  • External attachment storage using either MinIO or the local filesystem.
  • A "default attachment store" setting for documents, which sets the destination for newly uploaded attachments.
  • Defaulting to existing behaviour. Without additional configuration, Grist's attachment storage behaviour is unchanged.
  • MinIO attachment storage is automatically made available if it's configured for document storage, but not automatically used by documents.

No UI exists for configuring anything related to attachment storage at this time. To use a store, the API or an SQLite client needs using to set the idOfDefaultAttachmentStore document setting to the ID of the desired attachment store. Attachment store IDs are logged to the console.

Currently only a single store of each backend type (minio, filesystem) can exist, as they're automatically configured using the existing setting.

Related issues

#1021

Review guidance / code overview

Three significant components are added in this PR, with the rest of the PR being generally glue around them:

  • AttachmentFileManager (in "AttachmentFileManager.ts") handles attachment fetching / saving logic on behalf of ActiveDoc
  • An IAttachmentStore is an instantiated store, that handles saving / loading attachments from a specific provider.
  • AttachmentStoreProvider is used to resolve store IDs to stores, and is the main way of providing other components (e.g AttachmentFileManager) with access to store instances.

The available types of attachment store are defined in coreCreator.ts, under the name "attachmentStoreBackends".

Additionally:

  • A few existing classes have been extended with streaming methods, to simplify interaction with the existing attachment code and to facilitate large files in the future without keeping copies in memory.
  • Tests may not pass yet, and additional unit tests are WIP.

Migrations / schema changes

  • One migration to _gristsys_files has been added, which adds a column to track which store a document is in.
  • idOfDefaultAttachmentStore has been added to the DocumentSettings JSON blob.

Both of these should be non-breaking on older installations, as both should be ignored.

Has this been tested?

Behaviour has been manually tested in the following ways:

  • Filesystem storage
  • MinIO client with MinIO storage
  • MinIO client with S3 storage

There are currently no unit tests. These should be added before merging, but the goal.

Recommended local testing setup

MinIO

  1. Set up Grist to use MinIO normally.
  2. Start Grist
  3. Retrieve the attachment store id from the console.
  4. Download the Grist document.
  5. Open the Grist document in a database client, and add "idOfDefaultAttachmentStore": "store id" to _grist_DocInfo -> documentSettings
  6. Upload the Grist document
  7. Try adding an attachment to the new document. The name should show up in MinIO's storage path on the local filesystem.

Known issues

  • When using S3 storage, attachments showed as broken until refreshing the page.

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a batch of superficial comments from a skim. Feels like it is coming along pretty well!

app/common/DocumentSettings.ts Outdated Show resolved Hide resolved
app/server/lib/ActiveDoc.ts Outdated Show resolved Hide resolved
@@ -2822,17 +2841,21 @@ export class ActiveDoc extends EventEmitter {
return this._dataEngine;
}

private async _getDocumentSettings(): Promise<DocumentSettings | undefined> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no settings, should that be an error? In what circumstances might that happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the method gets used in a migration. That answers that. Might still make sense to throw an error, but accept that error as a non-erroneous possibility in the migration. Feels like in all other circs it should be an error? No big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can switch this to throwing an error, then wrap the code below in _makeEngine in a try-catch block to handle this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd be fine

app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
* Gets the correct pool id for a given document, given the document's id and trunk id.
* Avoids other areas of the codebase having to understand how documents are mapped to pools.
* This is a key security point - documents which share a pool can access each others' attachments.
* Therefore documents with different security domains (e.g from different teams) need to be sure not to share
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good now to talk about our strategy, for snapshots and forks to have the same pool id and share attachments?

* This is a general-purpose interface that should abstract over many different storage providers,
* so shouldn't have methods which rely on one the features of one specific provider.
*
* This is distinct from `ExternalStorage`, in that it's more generic and doesn't support versioning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm "more generic" feels wrong here, since an interface mentioning docPoolIds: DocPoolId vs key: string feels much more specific in some ways. Not many things in the world have DocPoolIds. Maybe drop that text?

export class MinIOAttachmentStore implements IAttachmentStore {
constructor(
public id: string,
private _storage: MinIOExternalStorage,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason this is using MinIOExternalStorage specifically and won't work with S3ExternalStorage or AzureExternalStorage is new Stream methods, correct? It would still help I think to anticipate that developers will add those methods for those types of storage. It is totally fine just to implement for MinIO here, since this is core, but ideally the follow up work needed would be adding the methods and making small updates, and not doing a significant refactor. Trying to bear in mind for reminder of review to watch for specific dependencies on MinIO creeping in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read rest of PR. Feels fine, no MinIO creep 👍

Copy link
Contributor Author

@Spoffy Spoffy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% - I can't remember if I left the "Make an interface for streamable external storage" in a TODO or not somewhere, but it's planned, so we can turn "MinIOAttachmentStore" into "ExternalStorageAttachmentStore"

Copy link
Contributor Author

@Spoffy Spoffy Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now - any external storage should be usable, if wrapped in ExternalStorageAttachmentStore, and made to implement the streaming interface

app/server/lib/AttachmentStoreProvider.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentStoreProvider.ts Show resolved Hide resolved
}

public async getFileData(fileIdent: string): Promise<Buffer | null> {
this._log.debug({ fileIdent }, "retrieving file data");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this logging might be a bit much. When a document is opened, it loads attachment previews. That results in 1 or 2 log lines for every attachment in a document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might indeed be a bit much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 1 of the log lines - the other one is quite useful for debugging as it mentions the attachment pool, and which storage (document / external) the file is going into.

@Spoffy Spoffy marked this pull request as ready for review December 3, 2024 18:47
@jordigh jordigh self-assigned this Dec 3, 2024
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is copying of documents with external attachments addressed? Might have missed it. If not, would want to be sure we don't miss that in follow-up.

@@ -636,6 +648,7 @@ export class ActiveDoc extends EventEmitter {
useExisting?: boolean, // If set, document can be created as an overlay on
// an existing sqlite file.
}): Promise<ActiveDoc> {
console.log(`OPENOPENOPENOPENOPENOPENOPENOPEN ${this.docName} ANYTHING ELSE`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray debugging message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this and others. Thanks for catching this 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on removing it last minute? Just checking why it is still present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, likely an oversight on my part when doing the other bits of your feedback. Glad this wasn't ticked off at least!

app/server/lib/ActiveDoc.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Show resolved Hide resolved
app/server/lib/AttachmentStore.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentStore.ts Show resolved Hide resolved
app/server/lib/AttachmentStoreProvider.ts Show resolved Hide resolved
Copy link
Contributor

@jordigh jordigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fairly complex setup, but given the problem, I think you have convinced me that all of this complexity is required. I could not find many opportunities for simplification.

// Compatible with Document entity for ease of use
export interface AttachmentStoreDocInfo {
id: string;
// This should NOT be an optional: the programmer should make an explicit choice for this value to prevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why make it optional if it shouldn't be?

Copy link
Contributor Author

@Spoffy Spoffy Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inherently a nullable value - a document might not have trunkId. But it's not "optional" in the sense of typescript's "optional".

If I make it an optional (i.e typescripts ? syntax), code like this won't error:

getDocPoolIdFromDocInfo({ id: _someDoc.id })

The above is incorrect - and would result in attachments ending up in the wrong place (the fork's attachment file pool, instead of the original document's file pool).

This is correct:

getDocPoolIdFromDocInfo({ id: _someDoc.id, trunkId: _someDoc.trunkId })

So by omitting ? and using string | null | undefined instead, it's hopefully catching a class of errors early.

That's what this comment is trying (and failing?) to explain. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, boy, "optional" vs "an optional", subtle reading thing there.

Maybe rephrase, "We explicitly make this a union type instead of making the attribute optional because the programmer must make a conscious choice to mark it as null or undefined, not merely omit it."

app/server/lib/AttachmentStore.ts Show resolved Hide resolved
/**
* Gets the correct pool id for a given document, given the document's id and trunk id.
*
* Attachments are stored in a "Document Pool", which is used to manage the attachments' lifecycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have silly niggle: can we wrap these comments at fewer columns? They're already awkward to read as github diffs:

image

Looks like you picked 120 columns. How about 100 columns instead? (or my favourite: 80)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120 is what we've got configured in .eslint.js. Happy to go lower personally, but is it worth discussing at a Grist level? We could lower the max length for comments only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be much of a problem beyond diffs. If you can easily warp comments at a lower character count, nice. If you can't, don't worry about it and let's move on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated IDE with settings for this for the future, probably a bit too big to go through and update all comments right now :)

private _storage: StreamingExternalStorage,
private _prefixParts: string[]
) {
if (_storage.removeAllWithPrefix == undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to allow the possibility of null == undefined in this comparison?

Copy link
Contributor Author

@Spoffy Spoffy Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Honestly, a simple !_storage.removeAllWithPrefix would do. removeAllWithPrefix needs to be a valid function for this to work.

Would !_storage.removeAllWithPrefix be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I kind of consider the == operator to be "legacy". At least checking for falsey is clearer in intent.

app/server/lib/AttachmentStore.ts Show resolved Hide resolved
app/server/lib/MinIOExternalStorage.ts Outdated Show resolved Hide resolved
test/server/lib/ActiveDoc.ts Outdated Show resolved Hide resolved
app/server/utils/MemoryWritableStream.ts Show resolved Hide resolved
app/server/lib/coreCreator.ts Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants