Clean, process and tokenise texts in milliseconds using in-built Polars string expressions
With the massive scale adoption of Large language Models (LLMs) it may appear that we’re past the stage where we needed to manually clean and process text data. Unfortunately, me and other NLP practitioners can attest that this may be very much not the case. Clean text data is required at every stage of NLP complexity — from basic text analytics to machine learning and LLMs. This post will showcase how this laborious and tedious process might be significantly sped up using Polars.
Polars is a blazingly fast Data Frame library written in Rust that’s incredibly efficient with handling strings (because of its Arrow backend). Polars stores strings within the Utf8
format using Arrow
backend which makes string traversal cache-optimal and predictable. Also, it exposes plenty of in-built string operations under the str
namespace which makes the string operations parallelised. Each of those aspects make working with strings extremely easy and fast.
The library shares plenty of syntaxis with Pandas but there are also plenty of quirks that you simply’ll must get used to. This post will walk you thru working with strings but for a comprehensive overview I highly recommend this “Getting Began” guide because it gives you overview of the library.
You’ll find all of the code on this GitHub repo, so make certain to tug it if wish to code along (don’t forget to ⭐ it). To make this post more practical and fun, I’ll showcase how we are able to clean a small scam email dataset which might be found on Kaggle (License CC BY-SA 4.0). Polars might be installed using pip — pip install polars
and the advisable Python version is 3.10
.
The goal of this pipeline is to parse the raw text file right into a DataFrame that might be used for further analytics/modelling. Listed below are the general steps that shall be implemented:
- Read in text data
- Extract relevant fields (e.g. sender email, object, text, etc.)
- Extract useful features from these fields (e.g. length, % of digits, etc)
- Pre-process text for further evaluation
- Perform some basic text analytics
Without further ado, let’s begin!
Reading Data
Assuming that the text file with emails is saved as fraudulent_emails.txt
, here’s the function used to read them in:
def load_emails_txt(path: str, split_str: str = "From r ") -> list[str]:
with open(path, "r", encoding="utf-8", errors="ignore") as file:
text = file.read()emails = text.split(split_str)
return emails
When you explore the text data you’ll see that the emails have two foremost sections
- Metadata (starts with
From r
) that incorporates email sender, subject, etc. - Email text (starts after
Status: O
orStatus: RO
)
I’m using the primary pattern to separate the continual text file into a listing of emails. Overall, we should always give you the option to read in 3977 emails that we put right into a Polars DataFrame for further evaluation.
emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame({"emails": emails})print(len(emails))
>>> 3977
Extracting Relevant Fields
Now the tricky part begins. How will we extract relevant fields from this mess of a text data? Unfortunately, the reply is .
Sender and Subject
Upon further inspection of metadata (below) you possibly can see that it has fields From:
and Subject:
that are going to be very useful for us.
From r Wed Oct 30 21:41:56 2002
Return-Path:
X-Sieve: cmu-sieve 2.0
Return-Path:
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA."
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Status: O
When you keep scrolling the emails, you’ll find that there are just a few formats for the From:
field. The primary format you see above where we’ve each name and email. The second format incorporates only the e-mail e.g. From: 123@abc.com
or From: “123@abc.com”
. With this in mind, we’ll need three patterns — one for subject, and two for sender (name with email and just email).
email_pattern = r"From:s*([^subject_pattern = r"Subject:s*(.*)"
name_email_pattern = r'From:s*"?([^"<]+)"?s*<([^>]+)>'
Polars has an str.extract
method that may compare the above patterns to our text and (you guessed it) extract the matching groups. Here’s how you possibly can apply it to the emails_pl
DataFrame.
emails_pl = emails_pl.with_columns(
# Extract the primary match group as email
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
# Extract the second match group as email
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
# Extract the topic
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
).with_columns(
# In cases where we didn't extract email
pl.when(pl.col("sender_email").is_null())
# Try one other pattern (just email)
.then(pl.col("emails").str.extract(email_pattern, 1))
# If we do have an email, do nothing
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)
As you possibly can see besides str.extract
we’re also using a pl.when().then().otherwise()
expressions (Polars version of if/else) to account for a second email only pattern. When you print out the outcomes you’ll see that usually it should’ve worked appropriately (and incredibly fast). We now have sender_name
, sender_email
and subject
fields for our evaluation.
Email Text
As was noted above, the actual email text starts after Status: O
(opened) or Status: RO
(read and opened) which implies that we are able to utilise this pattern to separate the e-mail into “metadata” and “text” parts. Below you possibly can see the three steps that we’d like to take to extract the required field and the corresponding Polars method to perform them.
- Replace
Status: RO
withStatus: O
in order that we only have one “split” pattern — usestr.replace
- Split the actual string by
Status: O
— usestr.split
- Get the second element (text) of the resulting list — use
arr.get(1)
emails_pl = emails_pl.with_columns(
# Apply operations to the emails column
pl.col("emails")
# Make these two statuses the identical
.str.replace("Status: RO", "Status: O", literal=True)
# Split using the status string
.str.split("Status: O")
# Get the second element
.arr.get(1)
# Rename the sector
.alias("email_text")
)
Et voilà! We have now extracted necessary fields in only just a few milliseconds. Let’s put all of it into one coherent function that we are able to later use within the pipeline.
def extract_fields(emails: pl.DataFrame) -> pl.DataFrame:
email_pattern = r"From:s*([^subject_pattern = r"Subject:s*(.*)"
name_email_pattern = r'From:s*"?([^"<]+)"?s*<([^>]+)>'emails = (
emails.with_columns(
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
)
.with_columns(
pl.when(pl.col("sender_email").is_null())
.then(pl.col("emails").str.extract(email_pattern, 1))
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)
.with_columns(
pl.col("emails")
.str.replace("Status: RO", "Status: O", literal=True)
.str.split("Status: O")
.arr.get(1)
.alias("email_text")
)
)
return emails
Now, we are able to move on to the feature generation part.
Feature Engineering
From personal experience, scam emails are inclined to be very detailed and long (since scammers try to win your trust) so the character length of an email goes to be quite informative. Also, they heavily use exclamation points and digits, so calculating the proportion of non-characters in an email may also be useful. Finally, scammers love to make use of caps lock, so let’s calculate the proportion of capital letters as well. There are after all, many more features we could create but to not make this post too long, let’s just give attention to these two.
The primary feature might be very easily created using an in-built str.n_chars()
function. The 2 other features might be computed using and str.count_match()
. Below you will discover the function to calculate these three features. Much like the previous function, it uses with_columns()
clause to hold over the old features and create the brand new ones on top of them.
def email_features(data: pl.DataFrame, col: str) -> pl.DataFrame:
data = data.with_columns(
pl.col(col).str.n_chars().alias(f"{col}_length"),
).with_columns(
(pl.col(col).str.count_match(r"[A-Z]") / pl.col(f"{col}_length")).alias(
f"{col}_percent_capital"
),
(pl.col(col).str.count_match(r"[^A-Za-z ]") / pl.col(f"{col}_length")).alias(
f"{col}_percent_digits"
),
)return data
Text Cleansing
When you print out just a few of the emails we’ve extracted, you’ll notice some things that have to be cleaned. For instance:
- HTML tags are still present in a number of the emails
- A lot of non-alphabetic characters are used
- Some emails are written in uppercase, some in lowercase, and a few are mixed
Same as above, we’re going to make use of regular expressions to scrub up the information. Nonetheless, now the strategy of alternative is str.replace_all
because we would like to exchange all of the matched instances, not only the primary one. Moreover, we’ll use str.to_lowercase()
to make all text lowercase.
emails_pl = emails_pl.with_columns(
# Apply operations to the emails text column
pl.col("email_text")
# Remove all the information in <..> (HTML tags)
.str.replace_all(r"<.*?>", "")
# Replace non-alphabetic characters (except whitespace) in text
.str.replace_all(r"[^a-zA-Zs]+", " ")
# Replace multiple whitespaces with one whitespace
# We want to do that due to previous cleansing step
.str.replace_all(r"s+", " ")
# Make all text lowercase
.str.to_lowercase()
# Keep the sector's name
.keep_name()
)
Now, let’s refactor this chain of operations right into a function, in order that it might be applied to the opposite columns of interest as well.
def email_clean(
data: pl.DataFrame, col: str, new_col_name: str | None = None
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.str.replace_all(r"<.*?>", " ")
.str.replace_all(r"[^a-zA-Zs]+", " ")
.str.replace_all(r"s+", " ")
.str.to_lowercase()
.alias(new_col_name if new_col_name is just not None else col)
)return data
Text Tokenisation
As a final step within the pre-processing pipeline, we’re going to tokenise the text. Tokenisation goes to occur using the already familiar method str.split()
where as a split token we’re going to specify a whitespace.
emails_pl = emails_pl.with_columns(
pl.col("email_text").str.split(" ").alias("email_text_tokenised")
)
Again, let’s put this code right into a function for our final pipeline.
def tokenise_text(data: pl.DataFrame, col: str, split_token: str = " ") -> pl.DataFrame:
data = data.with_columns(pl.col(col).str.split(split_token).alias(f"{col}_tokenised"))return data
Removing Stop Words
When you’ve worked with text data before, you recognize that stop word removal is a key step in pre-processing tokenised texts. Removing these words allows us to focus the evaluation only on the necessary parts of the text.
To remove these words, we first must define them. Here, I’m going to make use of a default set of stop words from nltk
library plus a set of HTML related words.
stops = set(
stopwords.words("english")
+ ["", "nbsp", "content", "type", "text", "charset", "iso", "qzsoft"]
)
Now, we’d like to search out out if these words exist within the tokenised array, and in the event that they do, we’d like to drop them. For this we’ll need to make use of the arr.eval
method since it allows us to run the Polars expressions (e.g. .is_in
) against every element of the tokenised list. Be sure to read the comment below to grasp what the each line does as this a part of the code is more complicated.
emails_pl = emails_pl.with_columns(
# Apply to the tokenised column (it's a listing)
pl.col("email_text_tokenised")
# For each element, check if it isn't in a stopwords list and only then return it
.arr.eval(
pl.when(
(~pl.element().is_in(stopwords)) & (pl.element().str.n_chars() > 2)
).then(pl.element())
)
# For each element of a brand new list, drop nulls (previously items that were in stopwords list)
.arr.eval(pl.element().drop_nulls())
.keep_name()
)
As usual, let’s refactor this little bit of code right into a function for our final pipeline.
def remove_stopwords(
data: pl.DataFrame, stopwords: set | list, col: str
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.arr.eval(pl.when(~pl.element().is_in(stopwords)).then(pl.element()))
.arr.eval(pl.element().drop_nulls())
)
return data
While this pattern may appear quite complicated it’s well price it to make use of the pre-defined str
and arr
expressions to optimise the performance.
Full Pipeline
To date, we’ve defined pre-processing functions and saw how they might be applied to a single column. Polars provides a really handy pipe
method that enables us to chain Polars operations specified as function. Here’s how the ultimate pipeline looks like:
emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame({"emails": emails})emails_pl = (
emails_pl.pipe(extract_fields)
.pipe(email_features, "email_text")
.pipe(email_features, "sender_email")
.pipe(email_features, "subject")
.pipe(email_clean, "email_text")
.pipe(email_clean, "sender_name")
.pipe(email_clean, "subject")
.pipe(tokenise_text, "email_text")
.pipe(tokenise_text, "subject")
.pipe(remove_stopwords, stops, "email_text_tokenised")
.pipe(remove_stopwords, stops, "subject_tokenised")
)
Notice that now we are able to easily apply all of the feature engineering, cleansing, and tokenisation functions to all of the extracted columns and not only the e-mail text like within the examples above.
When you’ve got up to now — great job! We’ve read in, cleaned, processed, tokenised, and did basic feature engineering on ~4k text records in under a second (at the least on my Mac M2 machine). Now, let’s benefit from the fruits of our labor and do some basic text evaluation.
Initially, let’s have a look at the word cloud of the e-mail texts and marvel in any respect the silly things we are able to find.
# Word cloud function
def generate_word_cloud(text: str):
wordcloud = WordCloud(
max_words=100, background_color="white", width=1600, height=800
).generate(text)plt.figure(figsize=(20, 10), facecolor="k")
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
# Prepare data for word cloud
text_list = emails_pl.select(pl.col("email_text_tokenised").arr.join(" "))[
"email_text_tokenised"
].to_list()
all_emails = " ".join(text_list)
generate_word_cloud(all_emails)
Bank accounts, next of kin, security corporations, and decease relatives — it’s got all of it. Let’s see how these will seem like for text clusters created using easy TF-IDF and K-Means.
# TF-IDF with 500 words
vectorizer = TfidfVectorizer(max_features=500)
transformed_text = vectorizer.fit_transform(text_list)
tf_idf = pd.DataFrame(transformed_text.toarray(), columns=vectorizer.get_feature_names_out())# Cluster into 5 clusters
n = 5
cluster = KMeans(n_clusters=n, n_init='auto')
clusters = cluster.fit_predict(tf_idf)
for c in range(n):
cluster_texts = np.array(text_list)[clusters==c]
cluster_text = ' '.join(list(cluster_texts))
generate_word_cloud(cluster_text)
Below you possibly can see just a few interesting clusters that I’ve identified:
Besides these, I also found just a few non-sense clusters which implies that there continues to be room for improvements when it comes text cleansing. Still, it looks like we were in a position to extract useful clusters, so let’s call it a hit. Let me know which clusters you discover!
This post has covered a wide range of pre-processing and cleansing operations that Polars library means that you can do. We’ve seen use Polars to:
- Extract specific patterns from texts
- Split texts into lists based on a token
- Calculate lengths and the variety of matches in texts
- Clean texts using
- Tokenise texts and filter for stop words
I hope that this post was useful to you and also you’ll give Polars a probability in your next NLP project. Please consider subscribing, clapping and commenting below.
Not a Medium Member yet?
Radev, D. (2008), CLAIR collection of fraud email, ACL Data and Code Repository, ADCR2008T001, http://aclweb.org/aclwiki
Project Github https://github.com/aruberts/tutorials/tree/foremost/metaflow/fraud_email
Polars User Guide https://pola-rs.github.io/polars-book/user-guide/