Skip to content

Text analytics

Anže Sršen edited this page Dec 11, 2017 · 1 revision

EventRegistry package also has an Analytics class that can be used to perform various text analytics. The class will be extended with additional functionality, but for now it allows you to

  • semantically annotate your documents with entities and non-entities mentioned in the document,
  • categorize the document into a list of predefined categories based on DMOZ.org taxonomy,
  • compute sentiment of the document
  • determine the language of the document.

To visually test different methods please visit our demo pages.

Available methods

Semantic annotation

In order to semantically annotate a given document use code such as:

import {EventRegistry, Analytics} from "eventregistry";
const er = new EventRegistry();
const analytics = new Analytics(er);
analytics.annotate("Microsoft released a new version of Windows OS.").then((ann) => {
    console.info(ann);
});

Text categorization

Categorization is currently only supported for English language. To categorize the document into a predefined set of categories and identify top related keywords use code such as:

import {EventRegistry, Analytics} from "eventregistry";
const er = new EventRegistry();
const analytics = new Analytics(er);
analytics.categorize("Microsoft released a new version of Windows OS.").then((cat) => {
    console.info(cat);
});

Sentiment detection

Here is a sample code to detect the sentiment expressed in the document:

import {EventRegistry, Analytics} from "eventregistry";
const er = new EventRegistry();
const analytics = new Analytics(er);
analytics.sentiment("Microsoft released a new version of Windows OS.").then((cat) => {
    console.info(cat);
});

Language detection

Here is a sample code to detect the code of the document

import {EventRegistry, Analytics} from "eventregistry";
const er = new EventRegistry();
const analytics = new Analytics(er);
analytics.detectLanguage("Microsoft released a new version of Windows OS.").then((cat) => {
    console.info(cat);
});

Returned data format

Text categorization

{
    "dmoz": {
        // top categories associated with the text
        "categories": [
            {
                // category ID
                "label": "dmoz/Computers/Companies/Microsoft_Corporation",
                // relevance of the category to the document
                "score": 0.456
            },
            ....
        ],
        // top keywords that summarize the document and their weights
        "keywords": [
            {
                "keyword": "Computers",
                "wgt": 0.160
            }
            ...
        ]
    }
}

Language detection

{
    "reliable": true,
    "textBytes": 32,
    // the language candidates for the document
    "languages": [
        {
            "name": "ENGLISH",
            // ISO2 code of the language
            "code": "en",
            // probability of the document being in this language
            "percent": 96,
            "score": 1321
        },
        ...
    ]
}

Semantic annotation

{
    // the list of annotations
    "annotations": [
        {
            // the URL that uniquely identifies the concept represented by the annotation
        	"url": "http://en.wikipedia.org/wiki/Microsoft",
            // the label that can be used to represent the annotation (in the language of the document)
        	"title": "Microsoft",
            // the input language
        	"lang": "en",

            // secondary URL that uniquely identifies the concept as a concept on English wikipedia
        	"secUrl": "http://en.wikipedia.org/wiki/Microsoft",
            // label that can represent the concept in English language
        	"secTitle": "Microsoft",
            "secLang": "en",

            // dbpedia URI of the concept
            "dbPediaIri": "http://dbpedia.org/resource/Microsoft",
            // dbpedia types for the concept
            "dbPediaTypes": [
                "Agent",
                "Organisation",
                "Company"
            ],
            // general categorization of the concept (person, org or loc)
            "type": "org",
            // importance of the concept for the whole document
            "wgt": 0.6666,
            // mentions of the concept in the document
            "support": [
                {
                    // character positions in text
                    "chFrom": 0,
                    "chTo": 8,
                    // based on the word(s) mentioned in the text, how likely it is that this is the correct annotation
                    "pMentionGivenSurface": 0.253001126280801,
                    "pageRank": 0.03690052603740375,
                    // the word/phrase that is used to mention the concept in the text
                    "text": "Microsoft",
                    // word indices
                    "wFrom": 0,
                    "wTo": 0,
                    "wikiLang": "en"
                }
            ],
            "pageRank": 0.2520778231483313,
            // wikidata id for the concept
            "wikiDataItemId": "Q2283"
            // wikidata class ids for the concept
            "wikiDataClassIds": [
                "Q891723",
                "Q1058914",
                "Q4830453",
                "Q43229",
                "Q874405",
                "Q24229398",
                "Q16334295",
                "Q58778",
                "Q35120",
                "Q16334298",
                "Q286583",
                "Q17519152",
                "Q517966",
                "Q223557",
                "Q16889133",
                "Q18844919",
                "Q488383",
                "Q5127848"
            ],
            // wikidata class ids and names
            "wikiDataClasses": [
                {
                    "enLabel": "public company",
                    "itemId": "Q891723"
                },
                {
                    "enLabel": "software house",
                    "itemId": "Q1058914"
                },
                ...
            ]
        },
        ...
    ],
    // list of nouns identified in the document
    "nouns": [
        {
            // starting and ending indices of the noun
            "iFrom": 25,
            "iTo": 31,
            // normalized form of the text
            "normForm": "version",
            // list of Wordnet synset IDs for the word
            "synsetIds": [
                "101267901",
                "105840650",
                "105928513",
                "106408779",
                "106536389",
                "107173585"
            ]
        },
        ...
    ],
    // list of adjectives found in the document
    "adjectives": [
        {
            // position in the document
            "iFrom": 21,
            "iTo": 23,
            // normalized form of the adjective
            "normForm": "new",
            // wordnet synset ids
            "synsetIds": [
                "300024996",
                "300128733",
                "300818008",
                "300937186",
                "301640850",
                "301687167",
                "301687965",
                "302070491",
                "302584699"
            ]
        },
        ...
    ],
    // list of verbs identified in the document
    "verbs": [
        {
            // text positions
            "iFrom": 10,
            "iTo": 17,
            // normalized form of the verb
            "normForm": "release",
            // wordnet sysnsets
            "synsetIds": [
                "200069295",
                "200104868",
                "200269682",
                "200967625",
                "201436518",
                "201474550",
                "201757994",
                "202316304",
                "202421374",
                "202494047"
            ]
        },
        ...
    ],
    // list of adverbs
    "adverbs": [

    ],
    // there are other returned properties that don't have significant importance for the user
}