Plagiarism Detector

During the quarantine period I had the opportunity to do many things that I always used to find a reason to postpone. I read, I studied, I exercised, I cooked I even tried to learn a new music instrument (lol), but finally the thing I enjoyed the most was this small project I worked on.

Being a student and experiencing from a close distance the situation with the closed universities due to Corona Virus, I realised that a problem which always existed now seems to get bigger. I am talking about plagiarism incidents. Because of the digitalisation of many courses which used to rely on physical interaction between the students and the professors, more and more homeworks, tests or quizzes are conducted using computers. In a statistical perspective, it is really logical, the incidents of copying to increase.

Having this in mind, I developed an application which aims to detect suspicious pairs of documents for plagiarism in really big datasets of documents. In a second phase it can give the exact parts of text which seem to be the same.

You can find the code on GitHub by following this link: httpss://lnkd.in/gqHck8G and the whole application ready for use. A Jupyter notebook is also provided as a tutorial.