The Role of Data Science Communities in Changing the World

A real-world case on how communities can come together to solve challenging problems.

Published in

Omdena

6 min readNov 20, 2020

As someone within the tech industry for more than 5 years, I’m happily surprised by peers that want to make a positive impact in the world. They possess great qualities:

They have great ideas in concept with a broad target
They’re willing to learn a new technology
They’re willing to collaborate with others

However, these “nice things” also have their caveats:

Broad, general ideas without domain expertise are hard to put into practice
Most people get stuck on “reinventing the wheel”, on making something perfect
They only collaborate with other people in tech because of their social bubble

Sometimes, these caveats add weight to initiatives and delay projects from taking off. However, the key to solving these limitations is simple yet complex: Creating A Community.

The Challenge

This month before joining Google, I had the opportunity to be part of a diverse and growing community that cares about what’s going in the world: Omdena. I joined their AI Challenge hosted by the World Resources Institute to collaborate with other experts in their fields to solve the challenge of identifying financial incentives for reforestation and landscape restoration. The focus lied on looking into government policies for Mexico, Guatemala, El Salvador, Perú, and Chile.

Working within this community provided strong advantages:

Concise, grounded requirements distilled by domain experts within Omdena and the World Resources Institute which included government policy analysts
Collaborators with expert knowledge in state-of-the-art areas such as Natural Language Processing, Web Scraping, Databases, and Deep Learning
Collaborators with a huge focus on getting an MVP out as soon as possible while balancing quality and innovation

Background

To solve the overall challenge, we needed to achieve the following:

Extract official, reliable information from the web about the policies in this area
Normalize what is extracted to be used by the machine learning models we end up using.
Train machine learning models that give us the desired output
Present the results through a user-friendly interface

Web Application showing the relevant paragraphs for financial incentives in reforestation and landscape restoration from the segment highlighting model

The final solution consisted of the following:

Combination of Scrapy and APIFY as technologies to extract +100,000 documents related to financial incentives in reforestation and landscape restoration for 5 countries.
NLP pre-processor that uses NLTK, Spacy, and other Python libraries to normalize the inputs
Two semi-supervised models: a segment highlighter and a topic model. The segment highlighter (BETO) takes an article as an input and outputs the candidate paragraphs where you can find the financial incentives. The topic model uses LDA (Latent Dirichlet Allocation) to extract the topics within a document, such as reforestation.
Web Application served through a Flask Server hosted in an AWS EC2 instance

Topic Model Results Displayed in the Web Application

My Role

Within the project I had the pleasure to support the following responsibilities:

Preprocessing Task Lead

I had the honor of leading a group of collaborators of all levels of experience. This group worked on normalizing the documents scraped from the web to be used by our semi-supervised machine learning models. I managed these efforts mainly through the use of Trello, GitHub, Slack, and Zoom meetings.

Team Meeting with the Lovely Preprocessing Task Team!

Problem
Documents can contain spelling mistakes, artifacts from their source files like PDF, trailing white spaces, etc. This can make text data hard to understand for our machine learning models.

Additionally, we required the documents to be separated into paragraphs so that the segment highlighter picked paragraphs, not sentences, that contained the incentives.

Method

Whitespace Removal
Paragraph extraction and word-level tokenization
Spelling correction via Spell Check

We looked through different libraries for NLP pre-processing such as Spacy, NLTK, and Scikit, and tested for the desired results. I was specifically involved in providing recommendations on the tools we used, paragraph extraction, and spell checking.

The whole pipeline needed to be accessible to everyone in the preprocessing team so we could normalize the documents for the 5 countries. I was in charge of creating a Python program that integrated all of these components. I recorded a video so that beginner and advanced users could use the tool alike and make changes to it if they needed to.

Output
Consistent, normalized, clean text files that were used by the modeling task.

Application Software Architecture

I also lead the software architecture of the overall solution in tandem with the rest of the task leaders.

Problem
My background is in software engineering and my two priorities were: make everything as modular as possible and make everything mappable to what the client needed.

Method

For the preprocessing and modeling tasks, I made the initial skeleton of the preprocessing and modeling scripts in collaboration with my teammates. The scripts had to be modular, easy to extend, and readable.
I sketched out diagrams that provided details on how the components connected to deliver the data through the application to the end-user

I participated in meetings where I listened to the clients’ requirements and converted them to the use cases, stakeholder requirements, and system design with a formal standard that allowed it to be directly mappable to software modules within a high-level architecture.
Created a proof of concept MVP that served dummy data through a Python Flask server so that the mapping between the use cases and the modules would be clear to the data visualization, modeling, and preprocessing tasks.

Output

Modular code for the preprocessing and the segment highlighter model scripts
Proof of concept of how results should be served
Requirements and use cases that the overall solution needed to accomplish

Additional Support

During the segment highlighters' early stages, I helped with the benchmarking of models we used for this purpose. During the last steps of the topics model, I proposed the addition of negative examples to boost discrimination of the topics we cared about.

Through this diverse group of experts, we were able to finish and present a tool for actual policymakers in five countries that showed them exactly what they’re looking for, financial incentives across countries across topics, in a friendly user interface.

The solution itself made possible by our 50+ fellow collaborators

I’m proud I was able to present an excellent solution to policymakers that will help them make actionable decisions and improve forestry and natural landscapes in these five countries. Hopefully, this work can be expanded to more countries in the world.

Final Presentation to Policy Makers via Zoom

It wouldn’t have been possible without my fellow collaborators and the Omdena community to which we now belong.

Thank you for this opportunity!