Sometimes, we want to compute the similarity between two text documents with Python.
In this article, we’ll look at how to compute the similarity between two text documents with Python.
How to compute the similarity between two text documents with Python?
To compute the similarity between two text documents with Python, we can use the scikit-learn library.
To install it, we run
pip install -U scikit-learn
Then we use by writing
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T
to open the files with the paths in the text_files
list.
Then we create a TfidfVectorizer
object and call fit_transforms
with the strings returned by read
.
And then we get their pairwise similarity with tfidf * tfidf.T
.
Conclusion
To compute the similarity between two text documents with Python, we can use the scikit-learn library.