Clustering Data about Tanzania and NLP for automating extraction of data from PDF files
We are team XploData, a group of junior data scientists with a variety of backgrounds. As this was the second time we participated in the Data4Good hackathon, we worked simultaneously on two projects that are related to fighting HIV Drug Resistance (HIVDR). The goal of the first project was to group the different regions of Tanzania into clusters with similar non-HIV related parameters, including socio-demographics, education, health facilities, drug stockouts, road infrastructure, etc. This resulted in a number of clusters, to which we could compare some HIVDR related parameters of interest, among which viral load suppression was the most related to our clusters. The difficulties with this project were that since 2016, the names and the number of Tanzanian states have changed quite a bit, and data before 2016 is hard to combine with data after that year. Moreover, the most recent data of the last years is often not reported yet. The final dataset that we created was as such a combination of recent and less recent data. Our second project was the creation of a Natural Language Processing tool that can used to do some automatic data extraction. Our tool is able to scan a pdf for figures and tables of interest and can translate this information into structured data that can be used for further data analysis. The tool also automatically labels each document, to make classification of documents easier in the future.