Analysing massive amounts of forum posts to find treatments and results on specific types of cancer.
Kristian Nielsen and Jimmi Agerskov
Aarhus University - School of Engineering (ASE)
Over the last few years, data and the value of it, has experienced an increase in interest. On the internet, large cancer forums exist. Those forums contain millions of posts written by ordinary people. The posts contain different kinds of experience and knowledge within cancer.
When a person is diagnosed with cancer the same questions repeats:
Is there a cure?
Which treatments exist?
What have other people experienced?
The challenge is to retrieve and present all the useful information hidden within the millions of forum posts that is freely available online. Through this thesis we introduce such a cloud-based tool. The tool uses both unsupervised and supervised machine learning algorithms, while utilizing the possibilities within distributed computing to increase execution performance. The machine learning algorithms are text retrieval, clustering and classification.
Specifically, the clustering calculation is of high focus to achieve a faster execution time while providing the same result. Therefore, the clustering algorithm DBSCAN is implemented in the distributed computing pattern MapReduce, resulting in an implementation of the MR-DBSCAN algorithm.
In this thesis, we present the principles, methods and theories within text retrieval, clustering, classification and distributed computing, while presenting the design and implementation details of the provided tool.
The evaluation of the provided solution shows that it is superior in execution performance when compared to existing non-distributed clustering algorithms. Furthermore, the evaluation of the solution indicates that it is possible to cluster forum posts based on cancer types.
Keywords: Clustering, Text Retrieval, Classification, Distributed Computing, Microservices, Cancer