A set of python modules for cornel movie-dialogs corpus with storm.
This module include some classes extending storm ORM for cornel movie-dialogs corpus data.
pip install storm # if you not pip install cornel-movie-dialogs-corpus-storm
- download corpus and unzip
- generate database and insert with
generate-mdcorpus-database.py
for example:
generate-mdcorpus-database.py --corpus-dir "cornell movie-dialogs corpus" corpus.db
from mdcorpus.orm import * from mdcorpus.parser import * ...
- MovieTitlesMetadata
- Genre
- MovieGenreLine
- MovieCharactersMetadata
- MovieConversation
- MovieLine
- RawScriptUrl
This is memo when I dealt with corpus problems.
- I ignored an alphabet following year.
- for example, line 34,
1989/I
- for example, line 34,
- I ignored duplication for genre data.
- line 58,
['horror', 'mystery', 'mystery', 'sci-fi', 'sci-fi']
- line 58,
I use Python2.7
and I don't know how to use codecs
module.(Unicode HOWTO — Python 2.7ja1
documentation)
convert text-code to utf-8
with Mi
cornell movie-dialogs corpus$ file --mime {(ls)} README.txt: text/plain; charset=iso-8859-1 chameleons.pdf: application/pdf; charset=binary movie_characters_metadata.txt: text/plain; charset=iso-8859-1 movie_conversations.txt: text/plain; charset=us-ascii movie_lines.txt: text/plain; charset=us-ascii movie_titles_metadata.txt: text/plain; charset=iso-8859-1 raw_script_urls.txt: text/plain; charset=iso-8859-1
cornell movie-dialogs corpus$ file --mime {(ls)} README.txt: text/plain; charset=utf-8 chameleons.pdf: application/pdf; charset=binary movie_characters_metadata.txt: text/plain; charset=utf-8 movie_conversations.txt: text/plain; charset=us-ascii movie_lines.txt: text/plain; charset=us-ascii movie_titles_metadata.txt: text/plain; charset=utf-8 raw_script_urls.txt: text/plain; charset=utf-8
- line 115,
léon
- line 1727 - 1736,
léon
sqlite> select * from movie_titles_metadata where title = 'léon'; sqlite> select * from movie_titles_metadata where title = 'l駮n'; 114|l駮n|1994|8.6|204901