1 TL;DR

Just a bit of dataknut fun woven around the day job.

You’ll be wanting Section 6 for the trending hashtags…

2 Terms of re-use

2.1 License

CC-BY unless otherwise noted.

2.2 Citation

3 Purpose

I’ve not been in NZ at this time of year before so birdoftheyear, boty is a whole new cultural experience.

The idea is to extract and visualise tweets and re-tweets of birdoftheyear, boty (see https://twitter.com/hashtag/birdoftheyear and the Forest & Bird voting site).

Why? Err…. Just. Because.

4 How it works

Code borrows extensively from https://github.com/mkearney/rtweet

The analysis used rtweet to ask the free Standard Twitter search API to extract ‘all’ tweets containing the #birdoftheyear OR #boty hashtags in the freely available recent (last 7 days) twitterVerse. When a search is repeated the same tweet can appear more than once if one of it’s attributes (e.g. number of likes & re-tweets) has changed since the last search.

The search was repeated irregularly throughout the time that voting was open.

It is possible that not all relevant tweets have been extracted because the free (Standard) search does not run over a complete archive of all tweets.

Future work should instead use the Twitter streaming API to set up a proper siphon of relevant tweets throughout the relevant period.

## [1] "Found 69 files matching #birdoftheyear OR #boty in ~/Data/twitter/boty2018/raw/"

192,786 duplicates of the <created_at><screen_name> tuple were removed from the original 207,026 extracted tweets. Note that the duplicates exist in the raw data and may be useful for analysis of the dynamics of re-tweeting etc over time.

The cleaned data used in the rest of this report has:

5 Analysis

5.1 Tweets and Tweeters over time

Voting closed on Sunday 14th October with the results announced on Monday 15th.

Number of tweets and tweeters

Figure 5.1: Number of tweets and tweeters

Figure 5.1 shows the number of tweets and tweeters in the data extract by day. The quotes, tweets and re-tweets have been separated.

If you are in New Zealand and you are wondering why there are no tweets today (2018-10-17) the answer is that twitter data (and these plots) are working in UTC and (y)our today() may not have started yet in UTC. Don’t worry, all the tweets are here - it’s just our old friend the timezone… :-)

5.2 Who’s tweeting?

Next we’ll try by screen name.

N tweets per day by screen name

Figure 5.2: N tweets per day by screen name

Figure 5.2 is a really bad visualisation of all tweeters tweeting over time. Each row of pixels is a tweeter (the names are probably illegible) and a green dot indicates a few tweets in the given day while a red dot indicates a lot of tweets.

So let’s re-do that for the top 50 tweeters so we can see their tweetStreaks (tm)…

Top tweeters:

Table 5.1: Top 15 tweeters (all days)
screen_name nTweets
birdoftheyear 405
Forest_and_Bird 214
testeeves 189
vote4kaki 163
NatForsdick 145
coolbiRdpics 127
acheroraptor 121
freshwaterfelix 98
mifflangstone 85
Fonziethewhio 85
This_NZ_Life 74
AlisonBallance 74
lin_nah 72
newzealandbirds 71
64by4 68

And their tweetStreaks are shown in Figure 5.3

N tweets per day by screen name (top 50, reverse alphabetical)

Figure 5.3: N tweets per day by screen name (top 50, reverse alphabetical)

Any twitterBots…?

5.3 Which hashtags are mentioned the most?

This is very quick and dirty but… to calculate this we have to do a bit of string processing first.

This is how I have tidied the hashtags (make other suggestions here):

# First we make everything lower case
htLongDT <- htLongDT[, `:=`(htLower, tolower(htOrig))]  # lower case

# Next we remove the macrons just in case h/t:
# https://twitter.com/Thoughtfulnz/status/1046685305569345536
htLongDT <- htLongDT[, `:=`(htClean, stringr::str_replace_all(htLower, "[āēīōū]", 
    myUtils::deMacron))]

# Now remove 'team' from a string so that e.g. teamkaki == kaki
htLongDT <- htLongDT[, `:=`(htClean, gsub("team", "", htClean))]

# Now remove variants on 'vote'
htLongDT <- htLongDT[, `:=`(htClean, gsub("vote4", "", htClean))]
htLongDT <- htLongDT[, `:=`(htClean, gsub("vote", "", htClean))]

Table 5.2 shows the total count of each #hashtag by (re)tweet type. With thanks to David Hood for code to help make sure that kakī == kaki (etc).

Table 5.2: Top 20 hashtags
hashTag type count
birdoftheyear Re-tweet 6193
birdoftheyear Tweet 2908
birdoftheyear Quote 807
kaki Re-tweet 747
takayay Re-tweet 522
boty Re-tweet 252
kereru Re-tweet 249
boty Tweet 240
kaki Tweet 221
dammitgannet Re-tweet 207
ruru Re-tweet 160
nzbotyart Re-tweet 155
rockhopper Tweet 149
takahe Re-tweet 147
rockhopper Re-tweet 144
dammitgannet Tweet 126
kerecrew Re-tweet 120
kaki Quote 117
whio Tweet 93
kereru Tweet 83

Figure 5.4 plots the daily occurence of these hashtags after removing variants of #birdoftheyear OR #boty and selecting only those which have more than 10 mentions on any day. For clarity tweets and re-tweets are aggregated. See Section 7 for the problems with this #hashTag counting approach.

Most mentioned #hashtags per day (only > 10 per day shown)

Figure 5.4: Most mentioned #hashtags per day (only > 10 per day shown)

6 Most popular hashtags over time

So, who’s gonna win? No idea.

There are a lot of problems with this approach (see Section 7) but if the hashtags have any predictive value at all then Figure 6.1 should be an indicator of the direction of travel (watch for lines of apparently dis-similar hashtags where the macron fix has failed) and Figure 6.2 shows the totals to date.

The official results show the Kererū as the winner and the Kakī third after the Kākāpō.

Figure 6.1 uses plotly to avoid having to render a large legend - just hover over the lines to see who is who…

Figure 6.1: Cumulative hashtag counts over time (only total count >30 shown)

Total hashtag counts to date (only total count > 30 shown)

Figure 6.2: Total hashtag counts to date (only total count > 30 shown)

7 Problems

Loads of them. But primarily:

8 About

As ever, #YMMV.

Analysis completed in 73.274 seconds ( 1.22 minutes) using knitr in RStudio with R version 3.5.1 (2018-07-02) running on x86_64-apple-darwin15.6.0.

A special mention must go to rtweet (Kearney 2018) for the twitter API interaction functions.

Other R packages used:

References

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Kearney, Michael W. 2018. Rtweet: Collecting Twitter Data. https://cran.r-project.org/package=rtweet.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2016. Plotly: Create Interactive Web Graphics via ’Plotly.js’. https://CRAN.R-project.org/package=plotly.

Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.

———. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2016. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Xie, Yihui. 2016a. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

———. 2016b. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.

Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.