This guide helps you to create your custom dialect.
First of all, check if you really need a dialect. A dialect implements support for an external file provider like BucketFs or S3. A dialect does not implement support for different file types see the user_guide.
If your data source is not a file storage, but for example a document-database, you probably want to create a dialect of the virtual-schema-common-document.
As an example consider taking a look into the existing dialects. For example the one for BucketFS.
Start by creating a maven project, define by a pom.xml
file.
There add this generic base as a dependency:
<dependency>
<groupId>com.exasol</groupId>
<artifactId>virtual-schema-common-document-files</artifactId>
<version>VERSION</version>
</dependency>
Don't forget to replace VERSION
with the current latest release number of this repository.
The best point to start with your implementation is to create a class named <YOUR_FILE_SOURCE>FileLoader
that implements the FileLoader
interface.
FileLoader
s fetch the files from your data source and returns them as InputStream
s.
The file loader interface defines only the method loadFiles()
.
This method loads the files for a query and returns them as InputStreamWithResourceName
s (InputStreamWithResourceName
is a class, that bundles an input stream with a file name used for error messages).
You may have noticed that the loadFiles()
method does not take any arguments.
The reason for that is, that the arguments (which files to load) are passed to the FileLoaderFactory
.
By that you can implement different FileLoaders
and a FileLoaderFactory
that produces them depending on the request.
Typically, you will, however, implement a FileLoaderFactory
that only passes the parameters to the constructor of your FileLoader
.
Mind the following aspects when implementing the FileLoader
:
Your adapter must support searching for Wildcard patterns on the remote file system.
In case your file system does not (fully) support this, try to filter most using the filters that remote system supports (for example a prefix) and apply an accurate filter in your FileLoader
.
The generic adapter passes the search pattern as StringFilter
to your FileLoaderFactory
.
The StringFilter
class offers you helper functions like getStaticPrefix()
or Matchers
.
A Matcher
can check if a string fulfills the pattern or not.
The Virtual Schema adapter distributes the query processing over multiple parallel-running workers (UDFs).
Therefore it has to separate the data into multiple segments.
It does this by passing a SegmentDescription
to your FileLoaderFactory
. A SegmentDescription
consists of two numbers:
- a total number of segments
- a segment id Your adapter has to divide the input files into the specified number of segments and only return the contents of the specified segment. The generic adapter will execute the file loader multiple times in the parallel running workers each time with a different segment id.
Luckily you don't have to implement the segmentation your self. You can use the SegmentMatcher
for that purpose.
Now you will create the adapter definition.
Create a new class <YOUR_FILE_SOURCE>DocumentFilesAdapter
that extends DocumentFilesAdapter
.
In the file define a constant ADAPTER_NAME
:
public static final String ADAPTER_NAME = "<YOUR_FILE_SOURCE>_DOCUMENT_FILES";
In addition implement the methods getFileLoaderFactory()
and getAdapterName()
like:
@Override
protected FileLoaderFactory getFileLoaderFactory() {
return new BucketFsFileLoaderFactory();
}
@Override
protected String getAdapterName() {
return ADAPTER_NAME;
}
Finally you need to create a class named <YOUR_FILE_SOURCE>DocumentFilesAdapterFactory
implementing AdapterFactory
.
This will be the entry point of your Virtual Schema.
It is loaded via a service loader and builds the Virtual Schema adapter.
To add it create the file src/main/resources/META-INF/services/com.exasol.adapter.AdapterFactory
with
the fully qualified name of your AdapterFactory
as content:
com.example.<YOUR_FILE_SOURCE>DocumentFilesAdapterFactory
Don't forget to test your dialect. Take a look at the tests in the BucketFS dialect as an example. The BucketFs dialect includes unit and integration tests. The integration tests use the exasol-testconatiners.
If you need help, feel free to create a GitHub issue in this repository.
When you finished your dialect, we can list it on the README of this repository, so that others can find it. Just open an issue!