Skip to content

Latest commit

 

History

History
18 lines (13 loc) · 919 Bytes

README.md

File metadata and controls

18 lines (13 loc) · 919 Bytes

Source

SourceReader can extract text from a Source using a StreamReader. The main StreamReader is TikaTextExtractingStreamReader that will use Tika.

Tika isn't completely configurable in how it handles separated value files which leads to an interesting bug concerning a UTF_8 character that represents no character (BOM). We want to ignore this character but Tika throws an exception, therefore the custom class SeparatedValueStreamReader is used instead.

Testing

sbt +clean coverage +test coverageReport

Notes

The org.xenial.sqlite-jdbc driver isn't actually used but was previously included to suppress a warning that Apache Tika needs it to process sqlite files. We are not processing sqlite so is now removed. It has been left in the build.sbt but commented out.

License

This project is licensed under the terms of the Apache 2 license, which can be found in the repository as LICENSE.txt