Summarize long texts by combining graph-algorithmic approaches with distributive language vector models.
Work in progress.
Some code was taken from my another repository.
- Extract important words by wordrank algorithm for each sentences of text
- word2vec based word vectorization
- build adjacency matrix on top of that (distance between words in sentence)
- transfer adjacency matrix into weighted graph
- find clique of graph with max length
- therefore we deduce the most "important" words in graph-theoretic means
- Extract important sentences of text through textrank algorithm - function build_similarity_matrix as an entrypoint
- basically it compares sentences and deduce metric of similarity based on simple equality of tokens from sentences
- thus with that information we could find the most "informative" sentences (by means of high similarity metric)
- If necessary you could run whole algorithm several times through text (could be useful if you want to compress text more) - function generate_summary_loop as a full description of pipeline