Thursday, 16 May 2013

Creating a Fortune Telling Word Engine

Making an automaton speak random slices of people's tweets is one thing, making them approximate to a reasonably grammatic or intelligible phrase is quite another. It struck me the key to this is looking at the extent of structured and classification that may (or not!) exist within tweets. This is actually fairly simple. Although there are not many of them, there are grammar rules used on Twitter that are generally used in a consistent way.  There is a community etiquette that monitors and maintains these standards. The most important are listed below, with some indication of how this meaning might be used

  • @username - any word with an @ in front of it is very specifically a twitter username. E.g. @rosemarybeetle This is easy peasy. If a @username is featured, the person is either being discussed or directly alerted/invited to contribute. These could be separated out to give a list people involved in the discussion somehow - very handy!
  • #hashtag - this is twitter grammar for a topic of conversation. Once adopted a hashtag is intended to be used for only one specific subject. Again - hashtags are very useful and could be separated out to give a list of subjects being discussed or referenced.  Also extremely useful, although there is a special case to be handled, which is the main discussion hashtag. This is the glue of a twitter discussion, but will crowd out any otehr hashtags in terms of frequency of occurrence.
    There are some variations on this. For example some hashtags for a recurring event may have a root and a date modifier - e.g #MW2012 , #MW2013, etc. Hashtags can often be acronyms and there are also occassionally some acronymous (is that even a word?) hashtags that can lead to confusion. For instance the hastag #rdg has been used for some time by a local newspaper to donate the town of Reading, UK, but the popularity of japanese anime Red Data Girl led to the #rdg hashtag being widely adopted to refer to that.
  • URLs - this is also very handy as it is the twitter equivalent of a reference. These are usually not the actual url, but a shortened referral url. Again these were considered to be worth separating out as they are resources associated with the discussion and lead to more in depth ideas that are too long for representation on twitter
  • RT or RT:  - (ReTweet) this is the standard etiquette for acknowledging a tweet being sent is not original is someone else's tweet being sent on. It is equivalent to a traditional credit. If followed by a @username, there is an implicaton that this is the person who sent the original tweet, but this is not a strict rule. While RT might be useful, it was decided not to bother to distinguish usernames that might be being credited rather than mentioned or included
  • VIA  - similar to RT, this is usually an acknowledgement that this is a secondary retweet, naming the person who sent it on. It is almost always the case that a following @username is the person who retweeted it initially.
  • "words in quotes" - Quotation marks are used as a shorthand for a direct quotation from another tweet. 
Effectively, everything else inside a tweet is just the body of the message, so for the purposes of the Psychic Hive Mind Reader, the algorithm starts like this
  1. go get some tweets
  2. strip out and make a list of @usernames
  3. strip out and make a list of #hashtags
  4. strip out and make a list of urls
  5. make a list of everything else

No comments:

Post a Comment