Extracting keywords with similar meaning from a Text Corpus
Ever thought of extracting or searching for similar words from a big text corpus, but then you got thought of the hard work and then sat to code. Because that's the only option.
Woooh !!! I got tired of writing the above text.
Let's simplify and understand the problems with old school Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. We will also learn about the good-to-go way of extracting keywords. Both use the bag-of-words representation and ignore the semantics.
Example:- In a BOW representation, India and Indian both would be considered as different words even though they are semantically the same. (If you find this example interesting, give a smile, though I know it's not.)
I have been into NLP for the past 3 years. During these 3 years, I have found the importance of semantics in NLP.
There have been many such methods that do show promising results on extracting keywords with similar keywords but top2vec really gave some promising results. So let's come to Top2vec way for topic modeling and semantic search.
What is Top2vec?
In top2vec, which leverages joint document and word semantic embedding to find topic vectors.
Top2vec is a distributed topic vector which is calculated from dense areas of document vectors. The number of dense areas of documents found in the semantic space is assumed to be the number of prominent topics. The topic vectors are calculated as the centroids of each dense area of document vectors.
Top2vec is capable of performing the below functionalities after training.
- Get the number of detected topics.
- Get topics.
- Get topic sizes.
- Get hierarchical topics.
- Search topics by keywords.
- Search documents by topic.
- Search documents by keywords.
- Find similar words.
- Find similar documents.
In the top2vec, topic vector in the semantic embedding represents a prominent topic shared among documents.
Let's discuss the model description of top2vec. The following 4 methods are required to build this model.
Creating Semantic Embeddings
This follows the simple concept of placing semantically similar words together in the embedding space and dissimilar words further from each other. Words that represent the best should be present nearest to the dense cloud. Let's make it easy. Let's visualize
Creating a low dimensional vector space
Top2vec uses the concepts of UMAP for dimensionality reduction. But why only UMAP and not t-sne or something else. UMAP seems to have better results on large datasets.
UMAP has several hyper-parameters that determine how it performs dimension reduction. Perhaps the most important parameter is the number of nearest neighbors, which controls the balance between preserving global structure versus local structure in the low dimensional embedding.
Finding the dense cloud
UMAP reduced vectors are then used to find the clouds with the highly-dense documents that have similar documents. But there's a problem here, i.e, the data will be sparse at times, full of noises, and no prominent indication of topics.
HDBSCAN is used to find the dense areas of document vectors, as it was designed to handle both noise and variable density clusters. It solves the issue of sparse density. HDBSCAN assigns a label to each dense cluster of document vectors and assigns a noise label to all document vectors that are not in a dense cluster. Which makes it more effective.
Calculate Centroids in Original Dimensional Space
Now, as we have the labels for each cluster which was derived from Step3, we can calculate the topic vectors. The methods for detecting centroids are by the arithmetic mean of all the document vectors in the same dense cluster or geometric mean or using probabilities from the confidence of clusters.
Now let's test it out on code.
pip install top2vec
Training it on a Sklearn News Dataset:-
from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)
Below is a result of a used case scenario of searching topics in the word corpus.
If you like this do clap, show your love by sharing. You can follow me on medium for more such content.
Top2vec Github: https://github.com/ddangelov/Top2Vec
Let's connect on Linkedin.