Finding duplicate sentences using semantic understanding

Abhishek Patnaik
5 min readOct 24, 2019

Understanding the semantic meaning has always been a tough job in the field of NLP. Determining the similarity between texts is another difficult task. This blog requires a good knowledge of word embeddings.

Having worked in about 3 startups has made me feel the importance of semantic understanding. Well, this is one of the ways which I use these days.

Let's take for example there are two different sentences:-

“I want to learn java programming” and “I want to learn advanced java”

Both are different questions. We as humans can understand the duplication of sentences but how do we make our AI understand this? There was a challenge in Kaggle recently on finding the duplicate sentences in Quora. Refer this link for the challenge.

I tried out an unsupervised learning strategy for the same.

I have shared the code for the same in Github.

Now let’s get started

For this repo, I am using a dataset from kaggle Competition. Link to the dataset.

Let see how the data looks

Data fields

  • id — the id of a training set question pair
  • qid1, qid2 — unique ids of each question (only available in train.csv)
  • question1, question2 — the full text of each question
  • is_duplicate — the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

“is_duplicate” indicates whether the two sentences are the same or not. An 0 indicates that both are not similar to the opposite of which implies on label 1. This is how the data looks when we visualize the label column.

There are several ways with which we can proceed through this problem statement. Like finding cosine similarity between two sentence embeddings, Using Word Mover's theorem, training a random forest classifier. But we will be working on Infersent.

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

But let me state the reasons why we choosing Infersent above all in the list. Because proofs are important my friend :)

The WMD(word movers distance) approach is giving encouraging results. The Paper “Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering” uses centroid distance to do initial pruning and then WMD for fine results.

But, Note that WMD is slow. It is O(n*m) where n is length of sentence1 and m is length of sentence2. Thus making it more slow. You are dead if you are have a big dataset.

But these two approaches are not using the information coming from sequences. So there have been multiple research in this domain also.

Infersent uses the information coming from sequences. Infersent takes into consideration the importance of each word in a sentence. Refer to this graph below to have an understanding of who things work.

As we see Barack-Obama, president and United States are given more priority than other words. Don't worry we will get to this till the end of this blog.

Unlike in computer vision, where convolutional neural networks are predominant, there are multiple ways to encode a sentence using neural networks. Infersent is trained on a bi-directional LSTM architecture with max pooling, trained on the Stanford Natural Language Inference (SNLI) dataset.

This is how infersent aims to demonstrate that sentence encoder trained in natural language inference are able to learn sentence representations that capture universally useful features.

Wait but what in the world is this u,v, this arrows. Cool. Let's break it down to make it simpler.

Our goal is to train a generic encoder. We have a sentence encoder that outputs a representation for the premise u and the hypothesis v. After the sentence vectors are generated, 3 matching methods are applied to extract the relation between two i.e concatenation of two vectors (u,v), Element wise product u ∗ v and absolute element-wise difference.

Then the result is fed to the 3 class classifier. We will be covering the details of the LSTM models in our next blogs because that's, not our aim here ;) . Next blog would have a detailed working on LSTM’s and how infersent is trained. Initially, we start off by importing the file from Infersent.

Make sure that you have the sentence vectors. Open the terminal and execute.

mkdir encoder
curl -Lo encoder/infersent1.pkl
curl -Lo encoder/infersent2.pkl

Set the word vector path for the

Where build_vocab_k_words() does the work of loading embeddings of k most frequent words.

Define a cosine function and you would be good to go.

Where “A” and “B” are the sentence embeddings.

The results were pretty satisfying. I got an accuracy of 71% when I applied it on the kaggle dataset. I used a threshold value of 0.80 and considering every pair of sentences to be similar if there were above the threshold. You can see the detailed explanation in this notebook.

Moreover, we could have trained a classifier above it with the similarity score, tf-id-word-count values. The code for all the above will be shared in the upcoming blogs. So please do follow. Do comment if you face any issues. What would be your approach for the same?

One of the most most important factors of efficient programming is a laptop. Make sure you have a good laptop that helps you program efficiently. Here is a list of the best programming laptops in 2021. Click here

Follow me on Linkedin for Medium and Linkedin for more awesome content.



Abhishek Patnaik

I build product with passion. Follow me for product related blogs.