-
Notifications
You must be signed in to change notification settings - Fork 215
Home new
Welcome to the elasticsearch-river-mongodb wiki!
This river uses MongoDB as datasource. It support attachment (GridFS). The main branch support ElasticSearch 0.90.2 and MongoDB 2.4.5
The current implementation monitors oplog collection from the local database. Make sure to enable replica set [^1]. It does not support master / slave replication. Monitoring is done using tailable cursor [^2]. All operations are supported: insert, update, delete.
Mongo document is stored within ElasticSearch without transformation. The new document stored in ElasticSearch will use the same id as mongoDB.
GridFS Mongo document (with large binary content) requires transformation before being stored in ElasticSearch. It requires mapper-attachment plugin [^3] installed in ElasticSearch. A specific mapping is specified to support GridFS Mongo document in ElasticSearch:
{
"testindex" : {
"files" : {
"properties" : {
"content" : {
"path" : "full",
"type" : "attachment",
"fields" : {
"content" : { "type": "string" },
"author" : { "type": "string" },
"title" : { "type": "string" },
"keywords" : { "type" : "string" },
"date" : {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": { "type" : "string" }
}
},
"chunkSize": { "type" : "long" },
"md5" : { "type" : "string" },
"length" : { "type" : "long" },
"filename" : { "type" : "string" },
"contentType" : { "type" : "string" },
"uploadDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"metadata" : { "type" : "object" }
}
}
}
}
- content.content is the base 64 encoded binary content.
- content.title is GridFS filename property.
- content.content_type is GridFS content type property.
- chunkSize, md5, length, filename, contentType and uploadDate map the same property from GridFS MongoDB document.
- metadata (optional) map metadata properties from GridFS MongoDB document.
The plugin has a dependency to elasticsearch-mapper-attachment
- In order to install this plugin, simply run:
$ES_HOME\bin\plugin.bat -install elasticsearch/elasticsearch-mapper-attachments/1.7.0
-
Simply run:
$ES_HOME\bin\plugin.bat -install richardwilly98/elasticsearch-river-mongodb/1.6.5
-
Since ES 0.20.2 the plugins cannot be deployed from the command line above. Please execute as temporary workaround:
$ES_HOME\bin\plugin.bat -url https://github.com/downloads/richardwilly98/elasticsearch-river-mongodb/elasticsearch-river-mongodb-1.6.5.zip -install river-mongodb
Release 1.6.5 is the last release available from https://github.com/downloads/richardwilly98/elasticsearch-river-mongodb
- The river is also available in [Maven Central Repository]:
-url https://oss.sonatype.org/content/repositories/releases/com/github/richardwilly98/elasticsearch/elasticsearch-river-mongodb/1.6.11/elasticsearch-river-mongodb-1.6.11.zip -install river-mongodb```
- Restart ElasticSearch.
#### Manually
- Create path $ES_HOME\plugins\river-mongodb\
Copy files: mongo-java-driver-2.11.2.jar and elasticsearch-river-mongodb-1.6.11.jar in $ES_HOME\plugins\river-mongodb.
- Restart ElasticSearch.
### Configuration
Create a new river for each MongoDB collection that should be indexed by
ElasticSearch.
- Replace \${es.river.name}, \${mongo.db.name},
\${mongo.collection.name}, \${mongo.is.gridfs.collection},
\${es.index.name} and \${es.type.name} by the correct values.
Parameters servers, options and credentials are optional.
- \${es.river.name} is the Elasticsearch river name
- servers is an array of mongo instances. If not specify the
\${mongo.instance1.host} (default value: localhost) and
\${mongo.instance1.port} (default value: 27017)
- options define additional river options.
- Mongo options settings used by the driver (only
secondary_read_preference is implemented).
- drop_collection can be used to remove all document
associated with the index type when the collection is
dropper from MongoDB.
- exclude_fields this option will remove unwanted fields from
MongoDB before documents are indexed in ES. See example
[^4].
- include_collection this option will include the collection
name in the document indexed \${mongo.include.collection} is
the attribute name. See example [^5].
- credentials is an array of the credential required by the
databases. db can be 'admin' or 'local'. The credentials are
used to connect to local and \${mongo.db.name}. See example
[^6].
- In index a throttle size can be defined \${es.throttle.size}
default value is 500.
- In mongo a custom filter can be added in \${mongo.filter}. For
more details see [^7].
- From shell execute the command:
\$ curl -XPUT "localhost:9200/_river/\${es.river.name}/_meta" -d '\
{\
[type]() "mongodb",\
[mongodb]() { \
[servers]()\
[\
{ [host]() \${mongo.instance1.host}, [port]() \${mongo.instance1.port}
},\
{ [host]() \${mongo.instance2.host}, [port]() \${mongo.instance2.port}
}\
],\
[options]() { \
"secondary\_read\_preference" : true, \
[drop_collection]() \${mongo.drop.collection}, \
[exclude_fields]() \${mongo.exclude.field},\
[include_collection]() \${mongo.include.collection}\
},\
[credentials]()\
[\
{ [db]() "local", [user]() \${mongo.local.user}, [password]()
\${mongo.local.password} },\
{ [db]() "admin", [user]() \${mongo.db.user}, [password]()
\${mongo.db.password} }\
],\
[db]() \${mongo.db.name}, \
[collection]() \${mongo.collection.name}, \
[gridfs]() \${mongo.is.gridfs.collection},\
[filter]() \${mongo.filter}\
}, \
[index]() { \
[name]() \${es.index.name}, \
[throttle_size]() \${es.throttle.size},\
[type]() \${es.type.name}\
}\
}'
- Example:
\$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'\
{\
[type]() "mongodb",\
[mongodb]() {\
[db]() "testmongo", \
[collection]() "files", \
[gridfs]() true\
},\
[index]() {\
[name]() "testmongo", \
[type]() "files"\
}\
}'
### Validation
- Import a PDF file using mongofiles utility.\
$MONGO_HOME\bin\mongofiles.exe —host localhost:27017 —db testmongo —collection files put test-large-document.pdf
connected to: localhost:27017\
added file: \
{\
\_id: ObjectId('4f244b4528a039f8f1178fdd'), \
filename: "test-large-document.pdf", \
chunkSize: 262144, \
uploadDate: new Date(1327778630447), \
md5: "8ae3c6998db4ebbaf69421464e0c3ff9", \
length: 50255626 \
}\
done!
- Retrieve the indexed document by the id:
$ curl -XGET "localhost:9200/testmongo/files/4f244b4528a039f8f1178fdd?pretty=true"
\
\* The imported PDF contains the word 'Reference'.\
$ curl -XGET "localhost:9200/testmongo/files/\_search?q=Reference&pretty=true"
### Troubleshooting
- Add logging in $ES_HOME\config\logging.yml\
logger:
river.mongodb: TRACE
Then restart ES.\
Please post ES log file in [Github issues](https://github.com/richardwilly98/elasticsearch-river-mongodb/issues)
Features
--------
### Sharded collection
The plugin should point to one of more mongos instance. It will discover
automatically the shards available (looking at config.shards
collection).
It will create one thread monitoring oplog.rs for each shard.
### Script filters
This feature has been tested with
[lang-javascript](https://github.com/elasticsearch/elasticsearch-lang-javascript)
and
[lang-groovy](https://github.com/elasticsearch/elasticsearch-lang-groovy)
plugins.\
\* In order to install the plugins, simply run: \
$ES\_HOME\bin\plugin.bat -install elasticsearch/elasticsearch-lang-javascript/1.4.0
$ES\_HOME\bin\plugin.bat -install elasticsearch/elasticsearch-lang-groovy/1.4.0
New attributes "scriptType" and "script" should be added to "mongodb"
attribute. "scriptType" is optional with javascript plugin. It should be
set to groovy with groovy plugin.
- Example:
Assuming the document in MongoDB has an attribute "title_from_mongo"
and this attribute should mapped to the attribute "title".
$ curl -XPUT "localhost:9200/_river/mongoscriptfilter/_meta" -d'\
{\
[type]() "mongodb",\
[mongodb]() {\
[db]() "testmongo", \
[collection]() "documents", \
[script]() "ctx.document.title = ctx.document.title_from_mongo;
delete ctx.document.title_from_mongo;"\
},\
[index]() {\
[name]() "testmongo", \
[type]() "documents"\
}\
}'
Assuming the document in MongoDB has an attribute "state" if it's value
is 'CLOSED' the document should be deleted from ES index.
$ curl -XPUT "localhost:9200/_river/mongoscriptfilter/_meta" -d'\
{\
[type]() "mongodb",\
[mongodb]() {\
[db]() "testmongo", \
[collection]() "documents", \
[script]() "if( ctx.document.state == 'CLOSED' ) { ctx.deleted = true;
}"\
},\
[index]() {\
[name]() "testmongo", \
[type]() "documents"\
}\
}'
Assuming the document in MongoDB has an attribute "scorce". Only
documents with score \> 100 should be indexed.
$ curl -XPUT "localhost:9200/\_river/mongoscriptfilter/\_meta"-d '\
{\
[type]() "mongodb",\
[mongodb]() {\
[db]() "testmongo", \
[collection]() "documents", \
[script]() "if( ctx.document.score \> 100 ) { ctx.ignore = true; }"\
},\
[index]() {\
[name]() "testmongo", \
[type]() "documents"\
}\
}'
Examples for Groovy plugin are available here [^8]
Resources
---------
- Much more details about oplog collection:
[http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/](http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/)
[^1]: [http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial](http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial)
[^2]: [http://www.mongodb.org/display/DOCS/Tailable+Cursors](http://www.mongodb.org/display/DOCS/Tailable+Cursors)
[^3]: [http://www.elasticsearch.org/guide/reference/mapping/attachment-type.html](http://www.elasticsearch.org/guide/reference/mapping/attachment-type.html)
[^4]: [Example river settings with exclude
fields](https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/resources/issues/76/mongodb-river-simple.json)
[^5]: [Example river settings with include
collection](https://github.com/richardwilly98/elasticsearch-river-mongodb/tree/master/resources/issues/101)
[^6]: [Example river settings with replica set and database
authentication](https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/resources/river-settings-with-replicaset-and-authentication.txt)
[^7]: [MongoDB custom
filter](https://github.com/richardwilly98/elasticsearch-river-mongodb/wiki/MongoDB-custom-filter)
[^8]: [Example river settings using Groovy
script](https://github.com/richardwilly98/elasticsearch-river-mongodb/tree/master/resources/issues/87)