## Understand Semantic Structures with Transformers and Topic Modeling

10 hours ago

We live within the age of massive data. At this point it’s turn out to be a cliche to say that data is the oil of the twenty first century nevertheless it really is so. Data collection practices have resulted in huge piles of information in nearly everyone’s hands.

Interpreting data, nonetheless, is not any easy task, and far of the industry and academia still depend on solutions, which offer little within the ways of explanations. While deep learning is incredibly useful for predictive purposes, it rarely gives practitioners an understanding of the mechanics and structures that underlie the information.

Textual data is very tricky. While natural language and ideas like “topics” are incredibly easy for humans to have an intuitive grasp of, producing operational definitions of semantic structures is way from trivial.

In this text I’ll introduce you to different conceptualizations of discovering latent semantic structures in natural language, we are going to have a look at operational definitions of the idea, and ultimately I’ll show the usefulness of the tactic with a case study.

While topic to us humans looks as if a very intuitive and self-explanatory term, it’s hardly so once we attempt to provide you with a useful and informative definition. The Oxford dictionary’s definition is luckily here to assist us:

A subject that’s discussed, written about, or studied.

Well, this didn’t get us much closer to something we will formulate in computational terms. Notice how the word *subject, *is used to cover all of the gory details. This needn’t deter us, nonetheless, we will definitely do higher.

In Natural Language Processing, we frequently use a spatial definition of semantics. This might sound fancy, but essentially we imagine that semantic content of text/language might be expressed in some continuous space (often high-dimensional), where concepts or texts which are related are closer to every aside from people who aren’t. If we embrace this theory of semantics, we will easily provide you with two possible definitions for topic.

## Topics as Semantic Clusters

A quite intuitive conceptualization is to assume topic as groups of passages/concepts in semantic space which are closely related to one another, but not as closely related to other texts. This incidentally implies that one passage *can only belong to at least one topic at a time.*

This clustering conceptualization also lends itself to excited about topics *hierarchically. *You may imagine that the subject “animals” might contain two subclusters, one which is “Eukaryates”, while the opposite is “Prokaryates”, after which you would go down this hierarchy, until, on the leaves of the tree you will discover actual instances of concepts.

In fact a limitation of this approach is that longer passages might contain multiple topics in them. This might either be addressed by splitting up texts to smaller, atomic parts (e.g. words) and modeling over those, but we can even ditch the clustering conceptualization alltogether.

## Topics as Axes of Semantics

We can even consider topics because the underlying dimensions of the semantic space in a corpus. Or in other words: As a substitute of describing what groups of documents there are we’re explaining variation in documents by finding underlying **semantic signals**.

We’re explaining variation in documents by finding underlying semantic signals.

You may as an example imagine that crucial axes that underlie restaurant reviews can be:

- Satisfaction with the food
- Satisfaction with the service

I hope you see why this conceptualization is beneficial for certain purposes. As a substitute of us finding “good reviews” and “bad reviews”, we get an understanding of what it’s that drives differences between these. A popular culture example of this sort of theorizing is after all the political compass. Yet again, as an alternative of us being excited by finding “conservatives” and “progressives”, we discover the **aspects **that differentiate these.

Now that we got the philosophy out of the way in which, we will get our hands dirty with designing computational models based on our conceptual understanding.

## Semantic Representations

Classically the way in which we represented the semantic content of texts, was the so-called **bag-of-words** model. Essentially you make the very strong, and almost trivially mistaken assumption, that the unordered collection of words in a document is constitutive of its semantic content. While these representations are plagued with numerous issues (curse of dimensionality, discrete space, etc.) they’ve been demonstrated useful by many years of research.

Luckily for us, the cutting-edge has progressed beyond these representations, and we have now access to models that may represent text in context. Sentence Transformers are transformer models which may encode passages right into a high-dimensional continuous space, where semantic similarity is indicated by vectors having high cosine similarity. In this text I’ll mainly deal with models that use these representations.

## Clustering Models

Models which are currently essentially the most widespread in the subject modeling community for contextually sensitive topic modeling (Top2Vec, BERTopic) are based on the clustering conceptualization of topics.

They discover topics in a process that consists of the next steps:

- Reduce dimensionality of semantic representations using UMAP
- Discover cluster hierarchy using HDBSCAN
- Estimate importances of terms for every cluster using post-hoc descriptive methods (c-TF-IDF, proximity to cluster centroid)

These models have gained plenty of traction, mainly on account of their interpretable topic descriptions and their ability to get better hierarchies, in addition to to learn the variety of topics from the information.

If we would like to model nuances in topical content, and understand aspects of semantics, clustering models are usually not enough.

I don’t intend to enter great detail in regards to the practical benefits and limitations of those approaches, but most of them stem from philosophical considerations outlined above.

## Semantic Signal Separation

If we’re to find the axes of semantics in a corpus, we are going to need a brand new statistical model.

We are able to take inspiration from classical topic models, similar to **Latent Semantic Allocation. **LSA utilizes matrix decomposition to seek out latent components in *bag-of-words* representations. LSA’s fundamental goal is to seek out words which are highly correlated, and explain their cooccurrence as an underlying semantic component.

Since we aren’t any longer coping with bag-of-words, explaining away correlation may not be an optimal strategy for us. Orthogonality just isn’t statistical independence. Or in other words: Simply because two components are uncorrelated, it doesn’t mean that they’re statistically independent.

Orthogonality just isn’t statistical independence

Other disciplines have luckily provide you with decomposition models that discover maximally independent components. **Independent Component Evaluation **has been extensively utilized in Neuroscience to find and take away noise signals from EEG data.

The fundamental idea behind Semantic Signal Separation is that we will find maximally independent underlying semantic signals in a corpus of text by decomposing representations with ICA.

We are able to gain human-readable descriptions of topics by taking terms from the corpus that rank highest on a given component.

To show the usefulness of Semantic Signal Separation for understanding semantic variation in corpora, we are going to fit a model on a dataset of roughly 118k machine learning abstracts.

To reiterate once more what we’re trying to realize here: We wish to determine the size, along which all machine learning papers are distributed. Or in other words we would love to construct a spatial theory of semantics for this corpus.

For this we’re going to use a Python library I developed called Turftopic, which has implementations of most topic models that utilize representations from transformers, including Semantic Signal Separation. Moreover we’re going to install the HuggingFace datasets library in order that we will download the corpus at hand.

`pip install turftopic datasets`

Allow us to download the information from HuggingFace:

`from datasets import load_dataset`ds = load_dataset("CShorten/ML-ArXiv-Papers", split="train")

We’re then going to run Semantic Signal Separation on this data. We’re going to use the all-MiniLM-L12-v2 Sentence Transformer, because it is sort of fast, but provides reasonably prime quality embeddings.

`from turftopic import SemanticSignalSeparation`model = SemanticSignalSeparation(10, encoder="all-MiniLM-L12-v2")

model.fit(ds["abstract"])

model.print_topics()

These are highest rating keywords for the ten axes we present in the corpus. You may see that almost all of those are quite readily interpretable, and already aid you see what underlies differences in machine learning papers.

I’ll deal with three axes, form of arbitrarily, because I discovered them to be interesting. I’m a Bayesian evangelist, so Topic 7 looks as if an interesting one, as plainly this component describes how probabilistic, model based and causal papers are. Topic 6 appears to be about noise detection and removal, and Topic 1 is usually concerned with measurement devices.

We’re going to provide a plot where we display a subset of the vocabulary where we will see how high terms rank on each of those components.

First let’s extract the vocabulary from the model, and choose numerous words to display on our graphs. I selected to go along with words which are within the 99th percentile based on frequency (in order that they still remain somewhat visible on a scatter plot).

`import numpy as np`vocab = model.get_vocab()

# We are going to produce a BoW matrix to extract term frequencies

document_term_matrix = model.vectorizer.transform(ds["abstract"])

frequencies = document_term_matrix.sum(axis=0)

frequencies = np.squeeze(np.asarray(frequencies))

# We select the 99th percentile

selected_terms_mask = frequencies > np.quantile(frequencies, 0.99)

We are going to make a *DataFrame* with the three chosen dimensions and the terms so we will easily plot later.

`import pandas as pd`# model.components_ is a n_topics x n_terms matrix

# It comprises the strength of all components for every word.

# Here we're choosing components for the words we chosen earlier

terms_with_axes = pd.DataFrame({

"inference": model.components_[7][selected_terms],

"measurement_devices": model.components_[1][selected_terms],

"noise": model.components_[6][selected_terms],

"term": vocab[selected_terms]

})

We are going to use the Plotly graphing library for creating an interactive scatter plot for interpretation. The X axis goes to be the inference/Bayesian topic, Y axis goes to be the noise topic, and the colour of the dots goes to be determined by the measurement device topic.

`import plotly.express as px`px.scatter(

terms_with_axes,

text="term",

x="inference",

y="noise",

color="measurement_devices",

template="plotly_white",

color_continuous_scale="Bluered",

).update_layout(

width=1200,

height=800

).update_traces(

textposition="top center",

marker=dict(size=12, line=dict(width=2, color="white"))

)

We are able to already infer loads in regards to the semantic structure of our corpus based on this visualization. For example we will see that papers which are concerned with efficiency, online fitting and algorithms rating very low on statistical inference, that is somewhat intuitive. Then again what Semantic Signal Separation has already helped us do in a data-based approach is confirm, that deep learning papers are usually not very concerned with statistical inference and Bayesian modeling. We are able to see this from the words “network” and “networks” (together with “convolutional”) rating very low on our Bayesian axis. That is one among the criticisms the sphere has received. We’ve just given support to this claim with empirical evidence.

Deep learning papers are usually not very concerned with statistical inference and Bayesian modeling, which is one among the criticisms the sphere has received. We’ve just given support to this claim with empirical evidence.

We can even see that clustering and classification could be very concerned with noise, but that agent-based models and reinforcement learning isn’t.

Moreover an interesting pattern we may observe is the relation of our Noise axis to measurement devices. The words “image”, “images”, “detection” and “robust” stand out as scoring very high on our measurement axis. These are also in a region of the graph where noise detection/removal is comparatively high, while discuss statistical inference is low. What this implies to us, is that measurement devices capture plenty of noise, and that the literature is attempting to counteract these issues, but mainly not by incorporating noise into their statistical models, but by preprocessing. This makes plenty of sense, as as an example, Neuroscience is understood for having very extensive preprocessing pipelines, and plenty of of their models have a tough time coping with noise.

We can even observe that the bottom scoring terms on measurement devices is “text” and “language”. Plainly NLP and machine learning research just isn’t very concerned with neurological bases of language, and psycholinguistics. Observe that “latent” and “representation can also be relatively low on measurement devices, suggesting that machine learning research in neuroscience just isn’t super involved with representation learning.

In fact the probabilities from listed here are countless, we could spend loads more time interpreting the outcomes of our model, but my intent was to show that we will already find claims and establish a theory of semantics in a corpus through the use of Semantic Signal Separation.

Semantic Signal Separation should mainly be used as an exploratory measure for establishing theories, quite than taking its results as proof of a hypothesis.

One thing I would love to emphasise is that Semantic Signal Separation should mainly be used as an exploratory measure for establishing theories, quite than taking its results as proof of a hypothesis. What I mean here, is that our results are sufficient for gaining an intuitive understanding of differentiating aspects in our corpus, an then constructing a theory about what is occurring, and why it is occurring, nevertheless it just isn’t sufficient for establishing the idea’s correctness.

Exploratory data evaluation might be confusing, and there are after all no one-size-fits-all solutions for understanding your data. Together we’ve checked out find out how to enhance our understanding with a model-based approach from theory, through computational formulation, to practice.

I hope this text will serve you well when analysing discourse in large textual corpora. When you intend to learn more about topic models and exploratory text evaluation, be certain that to have a have a look at a few of my other articles as well, as they discuss some features of those subjects in greater detail.

*(( Unless stated otherwise, figures were produced by the creator. ))*