-
Notifications
You must be signed in to change notification settings - Fork 9
Accumulo schema
Currently all data is written into two tables: RawTwitterData and TermIndex.
Row | Column Family | Column Qualifier | Value |
---|---|---|---|
12 byte: 8 byte timestamp (seconds since 1970) and 4 byte murmur2_32 hash of tweet | 1 byte "t" | - | raw json |
Example:
Row from RawTwitterData:
rowkeyX t:[] {text:"A tweet about #ApacheAccumulo and #ApacheFlink and", user:{name:"user1", screen_name:"u1scr",...},...}
Row | Column Family | Column Qualifier | Value |
---|---|---|---|
term | field | row | frequency of term as string if > 1 |
Example:
Rows in TermIndex:
#apacheaccumulo text:rowkeyX
#apacheflink text:rowkeyX
about text:rowkeyX
and text:rowkeyX 2
apacheaccumulo text:rowkeyX
apacheflink text:rowkeyX
tweet text:rowkeyX
user1 user:rowkeyX
u1scr user:rowkeyX
Currently we are working on writing into a third table called TweetFrequency. It will provide information about the tweet frequency per language and has the following schema:
Row | Column Family | Column Qualifier | Value |
---|---|---|---|
time | language | empty | tweet-count |
Explanation of the entries:
time is a String of format YYYYMMDDhhmm
language contains the language-tag of the tweets
tweet-count is an int-value which refers to the number of tweets of this language per window
Example:
201605051430 en - 3
now there are 5 new incoming english tweets during the next minute so it'll be:
201605051430 en - 3
201605051431 en - 5
This table features a evenly split geospatial index per day. The mapsearch should use this index. Issues: #62 #63 #64 #65
Row | Column Family | Column Qualifier | Value |
---|---|---|---|
bytearray: spreading byte + day + geohash | rowkey to RawTwitterData (timestamp+hash) | lat+lon | |
1b + 2b + 8b = 11b | 8b + 4b = 12b | 2x 4b float = 8b |
- spreading byte: murmur_32 of original json string modulo 255
- day: days since 1.1.1970 -> as short
- geohash: 8 byte long string
- rowkey to RawTwitterData is part of the key to prevent key collision and includes the complete tweet timestamp