Skip to content

Utility methods to read text from input streams like zip files. Uses Apache Tika

License

Notifications You must be signed in to change notification settings

mdcatapult/scala-source-reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Source

SourceReader can extract text from a Source using a StreamReader. The main StreamReader is TikaTextExtractingStreamReader that will use Tika.

Tika isn't completely configurable in how it handles separated value files which leads to an interesting bug concerning a UTF_8 character that represents no character (BOM). We want to ignore this character but Tika throws an exception, therefore the custom class SeparatedValueStreamReader is used instead.

Testing

sbt +clean coverage +test coverageReport

Notes

The org.xenial.sqlite-jdbc driver isn't actually used but was previously included to suppress a warning that Apache Tika needs it to process sqlite files. We are not processing sqlite so is now removed. It has been left in the build.sbt but commented out.

License

This project is licensed under the terms of the Apache 2 license, which can be found in the repository as LICENSE.txt

About

Utility methods to read text from input streams like zip files. Uses Apache Tika

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages