Source

SourceReader can extract text from a Source using a StreamReader. The main StreamReader is TikaTextExtractingStreamReader that will use Tika.

Tika isn't completely configurable in how it handles separated value files which leads to an interesting bug concerning a UTF_8 character that represents no character (BOM). We want to ignore this character but Tika throws an exception, therefore the custom class SeparatedValueStreamReader is used instead.

Testing

sbt +clean coverage +test coverageReport

Notes

The org.xenial.sqlite-jdbc driver isn't actually used but was previously included to suppress a warning that Apache Tika needs it to process sqlite files. We are not processing sqlite so is now removed. It has been left in the build.sbt but commented out.

License

This project is licensed under the terms of the Apache 2 license, which can be found in the repository as LICENSE.txt

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
project		project
sources		sources
src		src
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source

Testing

Notes

License

About

Releases

Packages

Contributors 2

Languages

License

mdcatapult/scala-source-reader

Folders and files

Latest commit

History

Repository files navigation

Source

Testing

Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages