Skip to content

Latest commit

 

History

History
74 lines (50 loc) · 6.02 KB

README.md

File metadata and controls

74 lines (50 loc) · 6.02 KB

Learning Personality

Learning Personality is a project carried out during my bachelor internship. In collaboration with the MAD laboratory (DiSCo) of the University of Milano-Bicocca and a group of researchers from the psychology department of the university, the work consists in identifying a procedure capable of extracting, through automatic approaches, the personality of the "target" object, in this case the big five personality traits (called OCEAN), to which a given text, written in natural language, refers.

Different spaces of representation are explored, starting from an approach that exploits the bag-of-wordsrepresentation, up to the construction of an embedding of words using the skip-gram version of the word2vec algorithm of Tomas Mikolov. Three different types of artificial neural networks has been used.

Dataset

The dataset for this task can be downloaded from https://www.yelp.com/dataset/challenge

Goal

The nature of this thesis project is highly experimental and aims to present detailed analyses on the topic, as at present there are no important investigations that address the problem of learning personality traits starting from text in natural language.

Training

A series of Python scripts have been created to automate and make preprocessing, feature extraction and model training repeatable.

The first model implemented is a feed-forward fully-connected NN.

The second model uses a class of distributional algorithms which consist in the use of a neural network capable of learning, in an unsupervised way, the contexts of words. The word embedding generated here is used as input for a convolutional NN.

The third model transforms the regression problem into a binary multi-label classification problem, in which for each personality dimension the output will be 0 or 1.

Process

  • In the input_pipeline, preprocess_dataset, TFRecord_dataset, load_ocean files :

    • The json file is parsed by extracting only the reviews (transformed into lower cases).
    • The two training and test 80-20 datasets are generated: we have in total 1243000 sentences in the test dataset and 4974000 in the training one.
    • Sentences are generated by splitting on punctuation.
    • Stopwords are removed from sentences.
    • Sentences that do not contain the adjectives are deleted.
    • The three zips containing the entire dataset, the training and test ones, are saved on file.
  • In the dictionary, voc, remove_adj files :

    • A .txt file is generated containing for each line a word for all sentences in order.
    • We then sort the file, keep a counter for each word so as not to repeat and order again.
    • The adjectives that belong to the ocean dataset that appear in the dictionary are eliminated.
    • We generate a new compact file in which we have only the first n most frequent words, so that the 'UNK' token is subsequently associated with them.

Model 1

  • In the extract_features, model_input, training files :
    • A lookup-table is created containing the 60000 most frequent words. Unique words are indexed with a unique integer value (corresponding to the line number), words not included in the first 60000 most common words will be marked with "-1".
    • A reverse lookup-table is created that allows you to search for a word by going through its unique identifier. Unknown words, identified by '-1', are replaced with the 'UNK' token
    • The bag of words vector is generated and the ocean vector is associated with it.
    • We build a basic model, with n layers fully connected. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer.
    • The simulations can be prepared for n epochs. The optimizer chosen is Adagrad with a learning rate of 0,001. The target function used is the mean squared error MSE, moreover we used the root mean squared error RMSE.

Model 2

  • In the mikolov_features, mikolov_embedding_model, mikolov_embedding_training files :

    • The same procedure is performed to build the dictionary of the most frequent 60000 words.
    • The features for the construction of the embedding are generated by forming a dataset consisting of the coupling of each word with its context. The word on the right and the word on the left of the target are considered as context.
    • You can determine the size of the embedding and the number of negative labels used for sampling.
    • The objective function used by the network is the Stocastic Gradient Descent SGD.
  • In the mikolov_model, mikolov_training files :

    • We build a model with n convolutional layers and a final one fully connected. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer. Furthermore, after the first layer there is a pooling layer.
    • The simulations can be trained for n epochs. The optimizer chosen is Adagrad with a learning rate of 0.005. The target function used is the mean squared error MSE, also the use of the root mean squared error RMSE metric.

Model 3

  • In the mikolov_features file :
    • The same procedure as for model 2 is carried out, but the construction of the embedding takes place by forming a data set consisting of the coupling of each adjective of our interest with its context. The two words on the right and the two words on the left of the target are considered as context.

Model 4

  • In the mikolov_multiclass_binary_model, mikolov_multiclass_binary_training files :
    • The procedure for extracting the embedding is the same as for the two previous models.
    • We build a basic model, with n layers. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer.
    • The built model is similar to the previous one with the difference that the objective function used is a softmax cross entropy. Furthermore, the accuracy is used as metric for each personality trait, and the confusion matrices are plotted with Tensorboard.