Home Artificial Intelligence Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

0
Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

Clustering of Twitter data with Python, K-Means, and t-SNE

Towards Data Science
Tweet clusters t-SNE visualization, Image by creator

Within the article “What People Write about Climate” I analyzed Twitter posts using natural language processing, vectorization, and clustering. Using this method, it is feasible to search out distinct groups in unstructured text data, for instance, to extract messages about ice melting or about electric transport from 1000’s of tweets about climate. Throughout the processing of this data, one other query arose: what if we could apply the identical algorithm to not the messages themselves but to the time when those messages were published? This may allow us to investigate when and how often different people make posts on social media. It could be vital not only from sociological or psychological perspectives but, as we’ll see later, also for detecting bots or users sending spam. Last but not least, almost everybody is using social platforms nowadays, and it’s just interesting to learn something recent about us. Obviously, the identical algorithm could be used not just for Twitter posts but for any media platform.

Methodology

I’ll use mostly the identical approach as described in the primary part about Twitter data evaluation. Our data processing pipeline will consist of several steps:

  • Collecting tweets including the precise hashtag and saving them in a CSV file. This was already done within the previous article, so I’ll skip the small print here.
  • Finding the final properties of the collected data.
  • Calculating embedding vectors for every user based on the time of their posts.
  • Clustering the info using the K-Means algorithm.
  • Analyzing the outcomes.

Let’s start.

1. Loading the info

I will likely be using the Tweepy library to gather Twitter posts. More details could be present in the primary part; here I’ll only publish the source code:

import tweepy

api_key = "YjKdgxk..."
api_key_secret = "Qa6ZnPs0vdp4X...."

auth = tweepy.OAuth2AppHandler(api_key, api_key_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)

hashtag = "#climate"
language = "en"

def text_filter(s_data: str) -> str:
""" Remove extra characters from text """
return s_data.replace("&", "and").replace(";", " ").replace(",", " ")
.replace('"', " ").replace("n", " ").replace(" ", " ")

def get_hashtags(tweet) -> str:
""" Parse retweeted data """
hash_tags = ""
if 'hashtags' in tweet.entities:
hash_tags = ','.join(map(lambda x: x["text"], tweet.entities['hashtags']))
return hash_tags

def get_csv_header() -> str:
""" CSV header """
return "id;created_at;user_name;user_location;user_followers_count;user_friends_count;retweets_count;favorites_count;retweet_orig_id;retweet_orig_user;hash_tags;full_text"

def tweet_to_csv(tweet):
""" Convert a tweet data to the CSV string """
if not hasattr(tweet, 'retweeted_status'):
full_text = text_filter(tweet.full_text)
hasgtags = get_hashtags(tweet)
retweet_orig_id = ""
retweet_orig_user = ""
favs, retweets = tweet.favorite_count, tweet.retweet_count
else:
retweet = tweet.retweeted_status
retweet_orig_id = retweet.id
retweet_orig_user = retweet.user.screen_name
full_text = text_filter(retweet.full_text)
hasgtags = get_hashtags(retweet)
favs, retweets = retweet.favorite_count, retweet.retweet_count
s_out = f"{tweet.id};{tweet.created_at};{tweet.user.screen_name};{addr_filter(tweet.user.location)};{tweet.user.followers_count};{tweet.user.friends_count};{retweets};{favs};{retweet_orig_id};{retweet_orig_user};{hasgtags};{full_text}"
return s_out

if __name__ == "__main__":
pages = tweepy.Cursor(api.search_tweets, q=hashtag, tweet_mode='prolonged',
result_type="recent",
count=100,
lang=language).pages(limit)

with open("tweets.csv", "a", encoding="utf-8") as f_log:
f_log.write(get_csv_header() + "n")
for ind, page in enumerate(pages):
for tweet in page:
# Get data per tweet
str_line = tweet_to_csv(tweet)
# Save to CSV
f_log.write(str_line + "n")

Using this code, we are able to get all Twitter posts with a particular hashtag, made throughout the last 7 days. A hashtag is definitely our search query, we are able to find posts about climate, politics, or another topic. Optionally, a language code allows us to look posts in numerous languages. Readers are welcome to do extra research on their very own; for instance, it could possibly be interesting to check the outcomes between English and Spanish tweets.

After the CSV file is saved, let’s load it into the dataframe, drop the unwanted columns, and see what kind of information we’ve got:

import pandas as pd

df = pd.read_csv("climate.csv", sep=';', dtype={'id': object, 'retweet_orig_id': object, 'full_text': str, 'hash_tags': str}, parse_dates=["created_at"], lineterminator='n')
df.drop(["retweet_orig_id", "user_friends_count", "retweets_count", "favorites_count", "user_location", "hash_tags", "retweet_orig_user", "user_followers_count"], inplace=True, axis=1)
df = df.drop_duplicates('id')
with pd.option_context('display.max_colwidth', 80):
display(df)

In the identical way, as in the primary part, I used to be getting Twitter posts with the hashtag “#climate”. The result looks like this:

We actually don’t need the text or user id, but it could possibly be useful for “debugging”, to see how the unique tweet looks. For future processing, we’ll must know the day, time, and hour of every tweet. Let’s add columns to the dataframe:

def get_time(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.time()

def get_date(dt: datetime.datetime):
""" Get date from datetime """
return dt.date()

def get_hour(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.hour

df["date"] = df['created_at'].map(get_date)
df["time"] = df['created_at'].map(get_time)
df["hour"] = df['created_at'].map(get_hour)

We are able to easily confirm the outcomes:

display(df[["user_name", "date", "time", "hour"]])

Now we’ve got all of the needed information, and we’re able to go.

2. General Insights

As we could see from the last screenshot, 199,278 messages were loaded; those are messages with a “#Climate” hashtag, which I collected inside several weeks. As a warm-up, let’s answer a straightforward query: what number of messages per day about climate were people posting on average?

First, let’s calculate the full variety of days and the full variety of users:

days_total = df['date'].unique().shape[0]
print(days_total)
# > 46

users_total = df['user_name'].unique().shape[0]
print(users_total)
# > 79985

As we are able to see, the info was collected over 46 days, and in total, 79,985 Twitter users posted (or reposted) no less than one message with the hashtag “#Climate” during that point. Obviously, we are able to only count users who made no less than one post; alas, we cannot get the variety of readers this manner.

Let’s find the variety of messages per day for every user. First, let’s group the dataframe by user name:

gr_messages_per_user = df.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)

The “size” column gives us the variety of messages every user sent. I also added the “size_per_day” column, which is straightforward to calculate by dividing the full variety of messages by the full variety of days. The result looks like this:

We are able to see that essentially the most energetic users are posting as much as 18 messages per day, and essentially the most inactive users posted just one message inside this 46-day period (1/46 = 0,0217). Let’s draw a histogram using NumPy and Bokeh:

import numpy as np
from bokeh.io import show, output_notebook, export_png
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource, LabelSet, Whisker
from bokeh.transform import factor_cmap, factor_mark, cumsum
from bokeh.palettes import *
output_notebook()

users = gr_messages_per_user['user_name']
amount = gr_messages_per_user['size_per_day']
hist_e, edges_e = np.histogram(amount, density=False, bins=100)

# Draw
p = figure(width=1600, height=500, title="Messages per day distribution")
p.quad(top=hist_e, bottom=0, left=edges_e[:-1], right=edges_e[1:], line_color="darkblue")
p.x_range.start = 0
# p.x_range.end = 150000
p.y_range.start = 0
p.xaxis[0].ticker.desired_num_ticks = 20
p.left[0].formatter.use_scientific = False
p.below[0].formatter.use_scientific = False
p.xaxis.axis_label = "Messages per day, avg"
p.yaxis.axis_label = "Amount of users"
show(p)

The output looks like this:

Messages per day distribution, Image by creator

Interestingly, we are able to see just one bar. Of all 79,985 users who posted messages with the “#Climate” hashtag, just about all of them (77,275 users) sent, on average, lower than a message per day. It looks surprising at first glance, but actually, how often will we post tweets in regards to the climate? Truthfully, I never did it in all my life. We want to zoom the graph rather a lot to see other bars on the histogram:

Messages per day distribution with a better zoom, Image by creator

Only with this zoom level can we see that amongst all 79,985 Twitter users who posted something about “#Climate”, there are lower than 100 “activists”, posting messages day by day! Okay, perhaps “climate” isn’t something individuals are making posts about each day, but is it the identical with other topics? I created a helper function, returning the proportion of “energetic” users who posted greater than N messages per day:

def get_active_users_percent(df_in: pd.DataFrame, messages_per_day_threshold: int):
""" Get percentage of energetic users with a messages-per-day threshold """
days_total = df_in['date'].unique().shape[0]
users_total = df_in['user_name'].unique().shape[0]
gr_messages_per_user = df_in.groupby(['user_name'], as_index=False).size()
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
users_active = gr_messages_per_user[gr_messages_per_user['size_per_day'] >= messages_per_day_threshold].shape[0]
return 100*users_active/users_total

Then, using the identical Tweepy code, I downloaded data frames for six topics from different domains. We are able to draw results with Bokeh:

labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
counts = [get_active_users_percent(df_climate, messages_per_day_threshold=1),
get_active_users_percent(df_politics, messages_per_day_threshold=1),
get_active_users_percent(df_cats, messages_per_day_threshold=1),
get_active_users_percent(df_humour, messages_per_day_threshold=1),
get_active_users_percent(df_space, messages_per_day_threshold=1),
get_active_users_percent(df_war, messages_per_day_threshold=1)]

palette = Spectral6
source = ColumnDataSource(data=dict(labels=labels, counts=counts, color=palette))
p = figure(width=1200, height=400, x_range=labels, y_range=(0,9),
title="Percentage of Twitter users posting 1 or more messages per day",
toolbar_location=None, tools="")
p.vbar(x='labels', top='counts', width=0.9, color='color', source=source)
p.xgrid.grid_line_color = None
p.y_range.start = 0
show(p)

The outcomes are interesting:

Percentage of energetic users, who posted no less than 1 message per day with a particular hashtag

The preferred hashtag here is “#Cats”. On this group, about 6.6% of users make posts each day. Are their cats just cute, and they can’t resist the temptation? Quite the opposite, “#Humour” is a preferred topic with a lot of messages, however the number of people that post multiple message per day is minimal. On more serious topics like “#War” or “#Politics”, about 1.5% of users make posts each day. And surprisingly, far more individuals are making each day posts about “#Space” in comparison with “#Humour”.

To make clear these digits in additional detail, let’s find the distribution of the variety of messages per user; it isn’t directly relevant to message time, however it continues to be interesting to search out the reply:

def get_cumulative_percents_distribution(df_in: pd.DataFrame, steps=200):
""" Get a distribution of total percent of messages sent by percent of users """
# Group dataframe by user name and kind by amount of messages
df_messages_per_user = df_in.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
users_total = df_messages_per_user.shape[0]
messages_total = df_messages_per_user["size"].sum()

# Get cumulative messages/users ratio
messages = []
percentage = np.arange(0, 100, 0.05)
for perc in percentage:
msg_count = df_messages_per_user[:int(perc*users_total/100)]["size"].sum()
messages.append(100*msg_count/messages_total)

return percentage, messages

This method calculates the full variety of messages posted by essentially the most energetic users. The number itself can strongly vary for various topics, so I take advantage of percentages as each outputs. With this function, we are able to compare results for various hashtags:

# Calculate 
percentage, messages1 = get_cumulative_percent(df_climate)
_, messages2 = get_cumulative_percent(df_politics)
_, messages3 = get_cumulative_percent(df_cats)
_, messages4 = get_cumulative_percent(df_humour)
_, messages5 = get_cumulative_percent(df_space)
_, messages6 = get_cumulative_percent(df_war)

labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
messages = [messages1, messages2, messages3, messages4, messages5, messages6]

# Draw
palette = Spectral6
p = figure(width=1200, height=400,
title="Twitter messages per user percentage ratio",
x_axis_label='Percentage of users',
y_axis_label='Percentage of messages')
for ind in range(6):
p.line(percentage, messages[ind], line_width=2, color=palette[ind], legend_label=labels[ind])
p.x_range.end = 100
p.y_range.start = 0
p.y_range.end = 100
p.xaxis.ticker.desired_num_ticks = 10
p.legend.location = 'bottom_right'
p.toolbar_location = None
show(p)

Because each axes are “normalized” to 0..100%, it is straightforward to check results for various topics:

Distribution of messages made by most energetic users, Image by creator

Again, the result looks interesting. We are able to see that the distribution is strongly skewed: 10% of essentially the most energetic users are posting 50–60% of the messages (spoiler alert: as we’ll see soon, not all of them are humans;).

This graph was made by a function that is just about 20 lines of code. This “evaluation” is pretty easy, but many additional questions can arise. There may be a definite difference between different topics, and finding the reply to why it’s so is clearly not straightforward. Which topics have the most important variety of energetic users? Are there cultural or regional differences, and is the curve the identical in numerous countries, just like the US, Russia, or Japan? I encourage readers to do some tests on their very own.

Now that we’ve got some basic insights, it’s time to do something tougher. Let’s cluster all users and check out to search out some common patterns. To do that, first, we’ll must convert the user’s data into embedding vectors.

3. Making User Embeddings

An embedded vector is an inventory of numbers that represents the info for every user. Within the previous article, I got embedding vectors from tweet words and sentences. Now, because I need to search out patterns within the “temporal” domain, I’ll calculate embeddings based on the message time. But first, let’s discover what the info looks like.

As a reminder, we’ve got a dataframe with all tweets, collected for a particular hashtag. Each tweet has a user name, creation date, time, and hour:

Let’s create a helper function to indicate all tweet times for a particular user:

def draw_user_timeline(df_in: pd.DataFrame, user_name: str):
""" Draw cumulative messages time for specific user """
df_u = df_in[df_in["user_name"] == user_name]
days_total = df_u['date'].unique().shape[0]

# Group messages by time of the day
messages_per_day = df_u.groupby(['time'], as_index=False).size()
msg_time = messages_per_day['time']
msg_count = messages_per_day['size']

# Draw
p = figure(x_axis_type='datetime', width=1600, height=150, title=f"Cumulative tweets timeline during {days_total} days: {user_name}")
p.vbar(x=msg_time, top=msg_count, width=datetime.timedelta(seconds=30), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
p.toolbar_location = None
p.x_range.start = datetime.time(0,0,0)
p.x_range.end = datetime.time(23,59,0)
p.y_range.start = 0
p.y_range.end = 1
show(p)

draw_user_timeline(df, user_name="UserNameHere")
...

The result looks like this:

Messages timeline for several users, Image by creator

Here we are able to see messages made by some users inside several weeks, displayed on the 00–24h timeline. We may already see some patterns here, but because it turned out, there’s one problem. The Twitter API doesn’t return a time zone. There may be a “timezone” field within the message body, however it is at all times empty. Possibly once we see tweets within the browser, we see them in our local time; on this case, the unique timezone is just not vital. Or perhaps it’s a limitation of the free account. Anyway, we cannot cluster the info properly if one user from the US starts sending messages at 2 AM UTC and one other user from India starts sending messages at 13 PM UTC; each timelines just is not going to match together.

As a workaround, I attempted to “estimate” the timezone myself by utilizing a straightforward empirical rule: most individuals are sleeping at night, and highly likely, they aren’t posting tweets at the moment 😉 So, we are able to find the 9-hour interval, where the common variety of messages is minimal, and assume that this can be a “night” time for that user.

def get_night_offset(hours: List):
""" Estimate the night position by calculating the rolling average minimum """
night_len = 9
min_pos, min_avg = 0, 99999
# Find the minimum position
data = np.array(hours + hours)
for p in range(24):
avg = np.average(data[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p

# Move the position right if possible (in case of long sequence of comparable numbers)
for p in range(min_pos, len(data) - night_len):
avg = np.average(data[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p
else:
break

return min_pos % 24

def normalize(hours: List):
""" Move the hours array to the precise, keeping the 'night' time on the left """
offset = get_night_offset(hours)
data = hours + hours
return data[offset:offset+24]

Practically, it really works well in cases like this, where the “night” period could be easily detected:

In fact, some people get up at 7 AM and a few at 10 AM, and and not using a time zone, we cannot find it. Anyway, it’s higher than nothing, and as a “baseline”, this algorithm could be used.

Obviously, the algorithm doesn’t work in cases like that:

In this instance, we just don’t know if this user was posting messages within the morning, within the evening, or after lunch; there is no such thing as a details about that. Nevertheless it continues to be interesting to see that some users are posting messages only at a particular time of the day. On this case, having a “virtual offset” continues to be helpful; it allows us to “align” all user timelines, as we’ll see soon in the outcomes.

Now let’s calculate the embedding vectors. There could be alternative ways of doing this. I made a decision to make use of vectors in the shape of [SumTotal, Sum00,.., Sum23], where SumTotal is the full amount of messages made by a user, and Sum00..Sum23 are the full variety of messages made by each hour of the day. We are able to use Panda’s groupby method with two parameters “user_name” and “hour”, which does just about all the needed calculations for us:

def get_vectorized_users(df_in: pd.DataFrame):
""" Get embedding vectors for all users
Embedding format: [total hours, total messages per hour-00, 01, .. 23]
"""
gr_messages_per_user = df_in.groupby(['user_name', 'hour'], as_index=True).size()

vectors = []
users = gr_messages_per_user.index.get_level_values('user_name').unique().values
for ind, user in enumerate(users):
if ind % 10000 == 0:
print(f"Processing {ind} of {users.shape[0]}")
hours_all = [0]*24
for hr, value in gr_messages_per_user[user].items():
hours_all[hr] = value

hours_norm = normalize(hours_all)
vectors.append([sum(hours_norm)] + hours_norm)

return users, np.asarray(vectors)

all_users, vectorized_users = get_vectorized_users(df)

Here, the “get_vectorized_users” method is doing the calculation. After calculating each 00..24h vector, I take advantage of the “normalize” function to use the “timezone” offset, as was described before.

Practically, the embedding vector for a comparatively energetic user may appear like this:

[120 0 0 0 0 0 0 0 0 0 1 2 0 2 2 1 0 0 0 0 0 18 44 50 0]

Here 120 is the full variety of messages, and the remainder is a 24-digit array with the variety of messages made inside every hour (as a reminder, in our case, the info was collected inside 46 days). For the inactive user, the embedding may appear like this:

[4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0]

Different embedding vectors will also be created, and a more complicated scheme can provide higher results. For instance, it could be interesting so as to add a complete variety of “energetic” hours per day or to incorporate a day of the week into the vector to see how the user’s activity varies between working days and weekends, and so forth.

4. Clustering

As within the previous article, I will likely be using the K-Means algorithm to search out the clusters. First, let’s find the optimum K-value using the Elbow method:

import matplotlib.pyplot as plt  
%matplotlib inline

def graw_elbow_graph(x: np.array, k1: int, k2: int, k3: int):
k_values, inertia_values = [], []
for k in range(k1, k2, k3):
print("Processing:", k)
km = KMeans(n_clusters=k).fit(x)
k_values.append(k)
inertia_values.append(km.inertia_)

plt.figure(figsize=(12,4))
plt.plot(k_values, inertia_values, 'o')
plt.title('Inertia for every K')
plt.xlabel('K')
plt.ylabel('Inertia')

graw_elbow_graph(vectorized_users, 2, 20, 1)

The result looks like this:

The Elbow graph for users embeddings, Image by creator

Let’s write the tactic to calculate the clusters and draw the timelines for some users:

def get_clusters_kmeans(x, k):
""" Get clusters using K-Means """
km = KMeans(n_clusters=k).fit(x)
s_score = silhouette_score(x, km.labels_)
print(f"K={k}: Silhouette coefficient {s_score:0.2f}, inertia:{km.inertia_}")

sample_silhouette_values = silhouette_samples(x, km.labels_)
silhouette_values = []
for i in range(k):
cluster_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append((i, cluster_values.shape[0], cluster_values.mean(), cluster_values.min(), cluster_values.max()))
silhouette_values = sorted(silhouette_values, key=lambda tup: tup[2], reverse=True)

for s in silhouette_values:
print(f"Cluster {s[0]}: Size:{s[1]}, avg:{s[2]:.2f}, min:{s[3]:.2f}, max: {s[4]:.2f}")
print()

# Create recent dataframe
data_len = x.shape[0]
cdf = pd.DataFrame({
"id": all_users,
"vector": [str(v) for v in vectorized_users],
"cluster": km.labels_,
})

# Show top clusters
for cl in silhouette_values[:10]:
df_c = cdf[cdf['cluster'] == cl[0]]
# Show cluster
print("Cluster:", cl[0], cl[2])
with pd.option_context('display.max_colwidth', None):
display(df_c[["id", "vector"]][:20])
# Show first users
for user in df_c["id"].values[:10]:
draw_user_timeline(df, user_name=user)
print()

return km.labels_

clusters = get_clusters_kmeans(vectorized_users, k=5)

This method is generally the identical as within the previous part; the one difference is that I draw user timelines for every cluster as an alternative of a cloud of words.

5. Results

Finally, we’re able to see the outcomes. Obviously, not all groups were perfectly separated, but among the categories are interesting to say. As a reminder, I used to be analyzing all tweets of users who made posts with the “#Climate” hashtag inside 46 days. So, what clusters can we see in posts about climate?

“Inactive” users, who sent just one–2 messages inside a month. This group is the most important; as was discussed above, it represents greater than 95% of all users. And the K-Means algorithm was in a position to detect this cluster as the most important one. Timelines for those users appear like this:

“Interested” users. These users posted tweets every 2–5 days, so I can assume that they’ve no less than some type of interest on this topic.

“Lively” users. These users are posting greater than several messages per day:

We don’t know if those individuals are just “activists” or in the event that they commonly post tweets as an element of their job, but no less than we are able to see that their online activity is pretty high.

“Bots”. These users are highly unlikely to be humans in any respect. Not surprisingly, they’ve the best variety of posted messages. In fact, I actually have no 100% proof that each one those accounts belong to bots, however it is unlikely that any human can post messages so commonly without rest and sleep:

The second “user”, for instance, is posting tweets at the identical time of day with 1-second accuracy; its tweets could be used as an NTP server 🙂

By the best way, another “users” aren’t really energetic, but their datetime pattern looks suspicious. This “user” has not so many messages, and there’s a visual “day/night” pattern, so it was not clustered as a “bot”. But for me, it looks unrealistic that an strange user can publish messages strictly in the beginning of every hour:

Possibly the auto-correlation function can provide good leads to detecting all users with suspiciously repetitive activity.

“Clones”. If we run a K-Means algorithm with higher values of K, we also can detect some “clones”. These clusters have similar time patterns and the best silhouette values. For instance, we are able to see several accounts with similar-looking nicknames that only differ within the last characters. Probably, the script is posting messages from several accounts in parallel:

As a final step, we are able to see clusters visualization, made by the t-SNE (t-distributed Stochastic Neighbor Embedding) algorithm, which looks pretty beautiful:

Here we are able to see numerous smaller clusters that weren’t detected by the K-Means with K=5. On this case, it is sensible to try higher K values; perhaps one other algorithm like DBSCAN (Density-based spatial clustering of applications with noise) may even provide good results.

Conclusion

Using data clustering, we were in a position to find distinctive patterns in tens of 1000’s of tweets about “#Climate”, made by different users. The evaluation itself was made only by utilizing the time of tweet posts. This could be useful in sociology or cultural anthropology studies; for instance, we are able to compare the web activity of various users on different topics, work out how often they make social network posts, and so forth. Time evaluation is language-agnostic, so it’s also possible to check results from different geographical areas, for instance, online activity between English- and Japanese-speaking users. Time-based data will also be useful in psychology or medicine; for instance, it is feasible to work out what number of hours individuals are spending on social networks or how often they make pauses. And as was demonstrated above, finding patterns in users “behavior” could be useful not just for research purposes but in addition for purely “practical” tasks like detecting bots, “clones”, or users posting spam.

Alas, not all evaluation was successful since the Twitter API doesn’t provide timezone data. For instance, it might be interesting to see if individuals are posting more messages within the morning or within the evening, but without having a correct time, it’s inconceivable; all messages returned by the Twitter API are in UTC time. But anyway, it’s great that the Twitter API allows us to get large amounts of information even with a free account. And clearly, the ideas described on this post could be used not just for Twitter but for other social networks as well.

Should you enjoyed this story, be happy to subscribe to Medium, and you’re going to get notifications when my recent articles will likely be published, in addition to full access to 1000’s of stories from other authors.

Thanks for reading.

LEAVE A REPLY

Please enter your comment!
Please enter your name here