First, let’s load tiktokr and some tidyverse libraries.

Make sure to use your preferred Python installation

The next two steps you only need to do once:

  1. Install necessary Python libraries
  1. Authentication

In November 2020, Tiktok increased its security protocol. They now frequently show a captcha, which is easily triggered after a few requests. This can be solved by specifying the cookie parameter. To get a cookie session:

  1. Open a browser and go to “
  2. Scroll down a bit, to ensure, that you don’t get any captcha
  3. Open the javascript console (in Chrome: View > Developer > Javascript Console)
  4. Run document.cookie in the console. Copy the entire output (your cookie).
  5. Run tk_auth() in R and paste the cookie.

Click on image below for screen recording of how to get your TikTok cookie:

The tk_auth function will save cookies (and user agent) as environment variable to your .Renviron file. You need to only run this once to use the tiktokr or whenever you want to update your cookie/user agent.

tk_auth(cookie = "<paste here the output from document.cookie>")

Getting #statstiktok posts

Once per script you need to run tk_init to initialize tiktokr

Let’s now get data on #statstiktok with tk_posts!

stats_tiktok <- tk_posts(scope = "hashtag", query = "statstiktok", n = 2000)

Great! Now we have a dataset with metadata on tiktoks mentioning the hashtag ‘statstiktok’!

#> Rows: 123
#> Columns: 67
#> $ video_id                   <chr> "awesome", "awesome", "awesome", "awesome"…
#> $ video_height               <int> 1024, 1024, 1024, 1024, 1024, 544, 1024, 1…
#> $ video_width                <int> 576, 576, 576, 576, 576, 960, 576, 576, 54…
#> $ video_duration             <int> 9, 7, 19, 12, 15, 10, 11, 13, 44, 8, 6, 9,…
#> $ video_ratio                <chr> "720p", "720p", "720p", "720p", "720p", "7…
#> $ video_cover                <chr> "…
#> $ video_originCover          <chr> "…
#> $ video_dynamicCover         <chr> "…
#> $ video_playAddr             <chr> "…
#> $ video_downloadAddr         <chr> "…
#> $ video_shareCover           <chr> "c(\"\", \"https://p16-sign-sg.tiktokcdn.c…
#> $ video_reflowCover          <chr> "…
#> $ author_id                  <chr> "6815076321426326533", "681507632142632653…
#> $ author_uniqueId            <chr> "chelllarson", "chelllarson", "chelseaparl…
#> $ author_nickname            <chr> "Mitchell", "Mitchell", "Chelsea Parlett",…
#> $ author_avatarThumb         <chr> "…
#> $ author_avatarMedium        <chr> "…
#> $ author_avatarLarger        <chr> "…
#> $ author_signature           <chr> "#statstiktok", "#statstiktok", "Just real…
#> $ author_verified            <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ author_secUid              <chr> "MS4wLjABAAAARU9QzliiCVeZiuAx4r-JdbzhucGjg…
#> $ author_secret              <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ author_ftc                 <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ author_relation            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ author_openFavorite        <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ author_commentSetting      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ author_duetSetting         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ author_stitchSetting       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ author_privateAccount      <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ music_id                   <chr> "6886831115148036869", "689022456349447245…
#> $ music_title                <chr> "Ur appreciated ily", "original sound", "o…
#> $ music_playUrl              <chr> "…
#> $ music_coverThumb           <chr> "…
#> $ music_coverMedium          <chr> "…
#> $ music_coverLarge           <chr> "…
#> $ music_authorName           <chr> "angela vasquez \U0001f49f", "carson", "Ra…
#> $ music_original             <chr> "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "…
#> $ stats_diggCount            <int> 42500, 2853, 2789, 1328, 1019, 638, 531, 2…
#> $ stats_shareCount           <int> 6136, 480, 355, 10, 44, 11, 46, 38, 0, 17,…
#> $ stats_commentCount         <int> 965, 75, 56, 27, 43, 2, 45, 16, 32, 10, 11…
#> $ stats_playCount            <int> 554100, 47200, 40100, 15200, 11500, 3386, …
#> $ authorStats_followingCount <int> 32, 32, 240, 240, 240, 24, 240, 240, 7790,…
#> $ authorStats_followerCount  <int> 259, 259, 1519, 1519, 1519, 2752, 1519, 15…
#> $ authorStats_heartCount     <int> 45500, 45500, 12000, 12000, 12000, 32100, …
#> $ authorStats_videoCount     <int> 5, 5, 69, 69, 69, 306, 69, 69, 490, 69, 69…
#> $ authorStats_diggCount      <int> 13800, 13800, 785, 785, 785, 8462, 785, 78…
#> $ authorStats_heart          <int> 45500, 45500, 12000, 12000, 12000, 32100, …
#> $ id                         <chr> "6895134839197011206", "689431916878766413…
#> $ desc                       <chr> "Where are all my fellow H0s at? No HAs al…
#> $ createTime                 <int> 1605398717, 1605208816, 1599615483, 159992…
#> $ challenges                 <chr> "list(id = \"1663286341999621\", title = \…
#> $ originalItem               <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ officalItem                <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ textExtra                  <chr> "list(awemeId = \"\", start = 47, end = 59…
#> $ secret                     <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ forFriend                  <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ digged                     <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ itemCommentStatus          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ showNotPass                <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ vl1                        <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ itemMute                   <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ privateItem                <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ duetEnabled                <chr> "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "T…
#> $ stitchEnabled              <chr> "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "T…
#> $ shareEnabled               <chr> "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "T…
#> $ isAd                       <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE…
#> $ effectStickers             <chr> "NULL", "NULL", "NULL", "NULL", "NULL", "N…

First, the data needs to be cleaned. Many variables that TikTok returns are not relevant, so we focus on the most important ones and clean the variable names using janitor::clean_names().

stats_tk <- stats_tiktok %>% 
  select(id, createTime, 
         author_signature, author_avatarLarger, 
         desc, music_id:authorStats_heart) %>% 
  janitor::clean_names() %>% 
  distinct(id, .keep_all = T)

#> Rows: 123
#> Columns: 26
#> $ id                           <chr> "6895134839197011206", "6894319168787664…
#> $ create_time                  <int> 1605398717, 1605208816, 1599615483, 1599…
#> $ author_id                    <chr> "6815076321426326533", "6815076321426326…
#> $ author_unique_id             <chr> "chelllarson", "chelllarson", "chelseapa…
#> $ author_nickname              <chr> "Mitchell", "Mitchell", "Chelsea Parlett…
#> $ author_signature             <chr> "#statstiktok", "#statstiktok", "Just re…
#> $ author_avatar_larger         <chr> "…
#> $ desc                         <chr> "Where are all my fellow H0s at? No HAs …
#> $ music_id                     <chr> "6886831115148036869", "6890224563494472…
#> $ music_title                  <chr> "Ur appreciated ily", "original sound", …
#> $ music_play_url               <chr> "…
#> $ music_cover_thumb            <chr> "…
#> $ music_cover_medium           <chr> "…
#> $ music_cover_large            <chr> "…
#> $ music_author_name            <chr> "angela vasquez \U0001f49f", "carson", "…
#> $ music_original               <chr> "TRUE", "TRUE", "TRUE", "TRUE", "FALSE",…
#> $ stats_digg_count             <int> 42500, 2853, 2789, 1328, 1019, 638, 531,…
#> $ stats_share_count            <int> 6136, 480, 355, 10, 44, 11, 46, 38, 0, 1…
#> $ stats_comment_count          <int> 965, 75, 56, 27, 43, 2, 45, 16, 32, 10, …
#> $ stats_play_count             <int> 554100, 47200, 40100, 15200, 11500, 3386…
#> $ author_stats_following_count <int> 32, 32, 240, 240, 240, 24, 240, 240, 779…
#> $ author_stats_follower_count  <int> 259, 259, 1519, 1519, 1519, 2752, 1519, …
#> $ author_stats_heart_count     <int> 45500, 45500, 12000, 12000, 12000, 32100…
#> $ author_stats_video_count     <int> 5, 5, 69, 69, 69, 306, 69, 69, 490, 69, …
#> $ author_stats_digg_count      <int> 13800, 13800, 785, 785, 785, 8462, 785, …
#> $ author_stats_heart           <int> 45500, 45500, 12000, 12000, 12000, 32100…

Now we have a cleaned sample of #statstiktok!

Stats about tiktokers

Let`s first find out who the top posters are in the data. Chelsea Parlett-Pelleriti is the most prolific poster, when it comes to #statstiktok. This actually makes sense, considering that she is one of the pioneers of #statstiktok.

stats_tk %>% 
  count(author_unique_id, sort = T) %>% 
  filter(n >= 2) %>% 
  mutate(authr_url = paste0("", author_unique_id)) %>% 
  knitr::kable() %>%
author_unique_id n authr_url
chelseaparlettpelleriti 61
baboutunt 5
dsquintana 5
lakens 4
epiellie 3
morgane_fevrier 3
statprof 3
bookmatter 2
chelllarson 2
dataislife 2
ladykelp 2
rismyfavouriteletter 2
ryansscience 2
sam_d_parsons 2

Metadata includes author data, such as the unique handle of the author (author_unique_id), the text of their bio (author_signature) and some author stats (like the follower count: author_stats_follower_count). In addition to classical social media metrics (number of accounts followed, posts, likes), Tiktok also includes ‘diggs’. Unfortunately, there is no documentation on how this metric is computed. The author_stats_heart shows how many likes a user has received in total.

tiktokers <- stats_tk %>% 
  select(contains("author")) %>% 
  add_count(author_unique_id, name = "vids_in_sample") %>% 
  distinct(author_id, .keep_all = T)

#> Rows: 39
#> Columns: 13
#> $ author_id                    <chr> "6815076321426326533", "6736543492652696…
#> $ author_unique_id             <chr> "chelllarson", "chelseaparlettpelleriti"…
#> $ author_nickname              <chr> "Mitchell", "Chelsea Parlett", "mr Akram…
#> $ author_signature             <chr> "#statstiktok", "Just really bad #statsT…
#> $ author_avatar_larger         <chr> "…
#> $ music_author_name            <chr> "angela vasquez \U0001f49f", "Rachel Mar…
#> $ author_stats_following_count <int> 32, 240, 24, 7790, 33, 3130, 2691, 42, 9…
#> $ author_stats_follower_count  <int> 259, 1519, 2752, 12100, 13100, 13700, 29…
#> $ author_stats_heart_count     <int> 45500, 12000, 32100, 122800, 119900, 195…
#> $ author_stats_video_count     <int> 5, 69, 306, 490, 124, 416, 108, 83, 178,…
#> $ author_stats_digg_count      <int> 13800, 785, 8462, 11800, 34000, 284, 149…
#> $ author_stats_heart           <int> 45500, 12000, 32100, 122800, 119900, 195…
#> $ vids_in_sample               <int> 2, 61, 1, 1, 1, 3, 1, 2, 1, 1, 4, 5, 1, …

Any famous tiktoker using #statstiktok?

tiktokers %>% 
  select(author_unique_id, contains("count")) %>% 
  arrange(desc(author_stats_following_count)) %>% 
  rename_all(stringr::str_remove_all, "(author_stats_)") %>%
  arrange(desc(follower_count)) %>%
  slice(1:10) %>%
  mutate(authr_url = paste0("", author_unique_id)) %>% 
  knitr::kable()  %>%
author_unique_id following_count follower_count heart_count video_count digg_count authr_url
morgane_fevrier 3130 13700 195900 416 284
easyworkgaming 33 13100 119900 124 34000
claredoe 7790 12100 122800 490 11800
dimplisimply81 4 9677 150600 471 5296
office305 9 3562 35200 178 978
bellasunrae 2691 2907 26300 108 14900
akrampathan4545 24 2752 32100 306 8462
chelseaparlettpelleriti 240 1519 12000 69 785
ryansscience 42 1464 39200 83 1298
mahdikhdiri 24 1059 14100 753 1429

Some of the tiktokers who used #statstiktok at least once seem to be pretty successful with more than 13700 followers. Maybe their one post post mentioning #statstiktok could have been be responsible for their success (…but most likely not :P). You can also see the strangeness of the digg count. It seems somehow uncorrelated to the other metrics.

Posts over time

When looking at frequency of posts over time, we see a continuous increase in the number of tiktoks mentioning #statstiktok since April 2020 (coinciding with the start of lockdowns in many countries!).

The function ´from_unix´ converts the timestamp in ´create_time´ to actual datetime.

stats_tk %>% 
  mutate(create_date = from_unix(create_time) %>% lubridate::floor_date("day")) %>% 
  count(create_date) %>% 
  mutate(cumsum_n = cumsum(n)) %>% 
  ggplot(aes(create_date, cumsum_n)) + 
  geom_line() +
  theme_minimal() +
  scale_x_datetime(date_labels = "%B %Y")  +
  labs(y = "Number of Posts", x = "")

Check out the most played tiktoks

stats_tk %>% 
  arrange(desc(stats_play_count)) %>% 
  select(id, author_unique_id, stats_play_count) %>% 
  slice(1:10) %>%
  mutate(video_url = paste0("", author_unique_id, "/video/", id)) %>% 
  knitr::kable()  %>%
id author_unique_id stats_play_count video_url
6895134839197011206 chelllarson 554100
6894319168787664133 chelllarson 47200
6870296141569871109 chelseaparlettpelleriti 40100
6871608918494252294 chelseaparlettpelleriti 15200
6854302480956673285 chelseaparlettpelleriti 12600
6871246394531876102 chelseaparlettpelleriti 11500
6818031944715062533 chelseaparlettpelleriti 3755
6827002877089697025 akrampathan4545 3386
6854389075894455558 chelseaparlettpelleriti 2603
6870863330244840709 chelseaparlettpelleriti 2487

Chelsea makes a lot of appearances again!

TikToks description

Each tiktok is usually accompanied with a brief text description. In this description, users typically use hashtags to increase the visibility of their posts.

Unsurprisingly, stats and academia related hashtags are used quite often combination with #statstiktok. We can now use these new hashtags to explore further stats tiktoks.

stats_tiktok %>%
  select(desc) %>%
  mutate(hashtags = stringr::str_extract_all(desc, "#\\w+")) %>%
  tidyr::unnest(hashtags) %>%
  mutate(hashtags = str_to_lower(hashtags)) %>% 
  count(hashtags, sort = T) %>% 
  slice(1:10) %>%
  mutate(hashtag_url = paste0("", str_remove(hashtags, "#"))) %>% 
  knitr::kable()  %>%
hashtags n hashtag_url
#statstiktok 123
#statistics 18
#fyp 17
#phdlife 15
#rstats 13
#academia 8
#datascience 7
#science 7
#bayesian 6
#duet 6

Expanding the data using hashtags

larger_stats_tiktok <- c("statsTikTok", "statistics", "rstats", "datascience") %>%
  purrr::map_dfr(~tk_posts("hashtag", .x, n = 2000)) %>%
  bind_rows(stats_tiktok) %>% # bind_rows the #statstiktok data
  distinct(id, .keep_all = T)


Using the hashtags we discovered earlier, we can now get data for other hashtags (statsTikTok, statistics, rstats, datascience). We obtain metadata on 3474 tiktoks, which can be further analyzed or used for further expansion.

larger_stats_tk <- larger_stats_tiktok %>% 
  select(id, createTime, 
         author_signature, author_avatarLarger, 
         desc, music_id:authorStats_heart) %>% 
  janitor::clean_names() %>% 
  distinct(id, .keep_all = T)

#> Rows: 3,474
#> Columns: 26
#> $ id                           <chr> "6895134839197011206", "6894319168787664…
#> $ create_time                  <int> 1605398717, 1605208816, 1599615483, 1599…
#> $ author_id                    <chr> "6815076321426326533", "6815076321426326…
#> $ author_unique_id             <chr> "chelllarson", "chelllarson", "chelseapa…
#> $ author_nickname              <chr> "Mitchell", "Mitchell", "Chelsea Parlett…
#> $ author_signature             <chr> "#statstiktok", "#statstiktok", "Just re…
#> $ author_avatar_larger         <chr> "…
#> $ desc                         <chr> "Where are all my fellow H0s at? No HAs …
#> $ music_id                     <chr> "6886831115148036869", "6890224563494472…
#> $ music_title                  <chr> "Ur appreciated ily", "original sound", …
#> $ music_play_url               <chr> "…
#> $ music_cover_thumb            <chr> "…
#> $ music_cover_medium           <chr> "…
#> $ music_cover_large            <chr> "…
#> $ music_author_name            <chr> "angela vasquez \U0001f49f", "carson", "…
#> $ music_original               <chr> "TRUE", "TRUE", "TRUE", "TRUE", "FALSE",…
#> $ stats_digg_count             <int> 42600, 2869, 2789, 1328, 1019, 638, 531,…
#> $ stats_share_count            <int> 6152, 483, 355, 10, 44, 11, 46, 38, 0, 1…
#> $ stats_comment_count          <int> 965, 75, 56, 27, 43, 2, 45, 16, 32, 10, …
#> $ stats_play_count             <int> 555300, 47400, 40100, 15200, 11500, 3391…
#> $ author_stats_following_count <int> 33, 33, 240, 240, 240, 24, 240, 240, 778…
#> $ author_stats_follower_count  <int> 259, 259, 1519, 1519, 1519, 2752, 1519, …
#> $ author_stats_heart_count     <int> 45600, 45600, 12100, 12100, 12100, 32100…
#> $ author_stats_video_count     <int> 5, 5, 69, 69, 69, 306, 69, 69, 490, 69, …
#> $ author_stats_digg_count      <int> 13800, 13800, 788, 788, 788, 8462, 788, …
#> $ author_stats_heart           <int> 45600, 45600, 12100, 12100, 12100, 32100…

Before closing up, we take a look at the popularity of the considered hashtags. To do so, we filter the 6 queried hashtags and look at the distributions of plays depending on the hashtag. We find out that - in our small and not-at-all random sample -, #statistics is associated with the most viewed, while #rstats is (for now) the least popular hashtag.

larger_stats_tk %>%
  select(desc, matches("^stats")) %>%
  mutate(hashtags = stringr::str_extract_all(desc, "#\\w+")) %>%
  tidyr::unnest(hashtags) %>%
  filter(hashtags %in% c("#statsTikTok", "#statstiktok", "#statistics", "#rstats", "#datascience")) %>%
  mutate(hashtags = forcats::fct_reorder(hashtags, stats_play_count)) %>%
  ggplot(aes(x = hashtags, y = stats_play_count)) + 
  geom_boxplot() + 
  ylim(c(0, 12e3)) +
  theme_minimal() +
  labs(x = "Hashtag", y = "Number of Plays")

Check out the most common music titles usedd

larger_stats_tk %>% 
  count(music_title, music_play_url, sort = T) %>% 
  filter(!str_detect(music_title, "original")) %>% 
  slice(1:10) %>%
  knitr::kable()  %>%
music_title music_play_url n
Classical Music 14
She Share Story (for Vlog) 14
The Banjo Beat, Pt. 1 12
What If (I Told You I Like You) 10
Banana (feat. Shaggy) [DJ FLe - Minisiren Remix] 8
Renee 8
Stranger 8
Monkeys Spinning Monkeys 7
Mystery of Love (From the Original Motion Picture “Call Me by Your Name”) 7
Play Date 7

Looks like Classical Music is quite popular with stats tiktok!