Skip to content

Accumulo schema

loewenzahm edited this page Jun 1, 2016 · 21 revisions

Currently all data is written into two tables: RawTwitterData and TermIndex.

Table: RawTwitterData

Row Column Family Column Qualifier Value
12 byte: 8 byte timestamp (seconds since 1970) and 4 byte murmur2_32 hash of tweet 1 byte "t" - raw json

Example:

Row from RawTwitterData:

rowkeyX t:[] {text:"A tweet about #ApacheAccumulo and #ApacheFlink and", user:{name:"user1", screen_name:"u1scr",...},...}

Table: TermIndex

Row Column Family Column Qualifier Value
term field row frequency of term as string if > 1

Example:

Rows in TermIndex:

#apacheaccumulo text:rowkeyX
#apacheflink text:rowkeyX
about text:rowkeyX
and text:rowkeyX 2
apacheaccumulo text:rowkeyX
apacheflink text:rowkeyX
tweet text:rowkeyX
user1 user:rowkeyX
u1scr user:rowkeyX

Table: TweetFrequency

Currently we are working on writing into a third table called TweetFrequency. It will provide information about the tweet frequency per language and has the following schema:

Row Column Family Column Qualifier Value
time language empty tweet-count

Explanation of the entries:

time is a String of format YYYYMMDDhhmm

language contains the language-tag of the tweets

tweet-count is an int-value which refers to the number of tweets of this language per window

Example:

              201605051430    en    -     3                    

now there are 5 new incoming english tweets during the next minute so it'll be:

              201605051430    en    -     3       
              201605051431    en    -     5       

Table GeoTemporalIndex

This table features a evenly split geospatial index per day. The mapsearch should use this index. Issues: #62 #63 #64 #65

Row Column Family Column Qualifier Value
bytearray: spreading byte + day + geohash rowkey to RawTwitterData (timestamp+hash) lat+lon
1b + 2b + 8b = 11b 8b + 4b = 12b 2x 4b float = 8b
  • spreading byte: murmur_32 of original json string modulo 255
  • day: days since 1.1.1970 -> as short
  • geohash: 8 byte long string
  • rowkey to RawTwitterData is part of the key to prevent key collision and includes the complete tweet timestamp