How to compute the similarity between two text documents with Python?

Sometimes, we want to compute the similarity between two text documents with Python.

In this article, we’ll look at how to compute the similarity between two text documents with Python.

How to compute the similarity between two text documents with Python?

To compute the similarity between two text documents with Python, we can use the scikit-learn library.

To install it, we run

pip install -U scikit-learn

Then we use by writing

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T

to open the files with the paths in the text_files list.

Then we create a TfidfVectorizer object and call fit_transforms with the strings returned by read.

And then we get their pairwise similarity with tfidf * tfidf.T.

Conclusion

To compute the similarity between two text documents with Python, we can use the scikit-learn library.