word similarity search (python/spark)

<< back to list of programming projects

The purpose of this project is to establish relationships between words in a collection of text documents. Textual relationships form an important paradigm of data mining and big data processing. Oftentimes, such comparison measures are utilized in web search engines, text auto-complete features, customer behavior prediction models etc. The ability to establish similarities between words across different documents can also prove extremely helpful for language translation and processing tools. This project solves a special case of the word similarity problem by determining the relationships between genes and diseases in a given text. The benefits of such a program can be far reaching, by providing doctors and scientists a tool for “mining” existing medical documents for new previously unknown relationships.

For this project, Apache Spark is used as the primary data processing engine. Mainly, Spark’s map & reduce features are used to run MapReduce algorithms on the input file to determine most similar words to a given query word. Python is used as the programming language, and pyspark is used as the API between Python and Spark. MapReduce algorithms are used due to the multitude of benefits that they offer such as scalability and flexibility. For a word similarity query that can have widely varying input size, scalability and flexibility are critical features. MapReduce also allows the program algorithm to run across multiple processing nodes, thus greatly improving runtime speed and efficiency.

Spark provides an efficient and simple interface to MapReduce algorithms through its Transformation and Action operations. In addition, Resilient Distributed Datasets (RDD) structure provides high levels of efficiency and fault-tolerance.