Final project | Big Data & Development

Students will complete a semester-long group project that includes some amount of data analysis and some amount of social and ethical analysis. While the format and topic of the final project is flexible — so long as it’s focused on big data and development — we expect students will produce high quality work that could be submitted for publication to a conference, journal, or workshop.

Project topic and expectations

Your final project should either (a) articulate and answer a novel research question in development, or (b) extend an existing research paper with novel analysis. The final project should be completed in a group of 2-4 students. The expectation for the scope and rigor of the project increases with group size. Students are required to commit to a group and a project topic by February 19. There will be in-class forums for students to discuss tentative project ideas and form groups.

The final output of the project is a publication-quality research paper of 10-20 pages. The paper must involve some amount of quantitative data analysis and some amount of social/ethical analysis. The relative amounts of the quantitative data analysis and social/ethical analysis are up to each group, and any ratio is acceptable, as long as some of each are included. For example, acceptable formats for the project would be:

A paper (10-20 pages) primarily focused on big data analysis, with a separate 2-4 page section on the social and ethical implications of the approach taken.
A paper (10-20 pages) primarily focused on social/ethical analysis of a use of big data in development, with a separate 2-4 page section involving data analysis.
A paper (10-20 pages) incorporating both data analysis and social/ethical analysis in equal measure.

The final project submission should be formatted in the PNAS submission template. In addition to the final paper, students are required to submit a one-page reflection on the experience of conducting data analysis paired with social/ethical analysis. What tensions and synergies did this approach raise? How will this approach inform (or not inform) your methods going forward? There is no specific format required for this one-page reflection.

Milestones

There are several deadlines prior to the final paper submission. These are designed to help you identify project ideas, form teams, and get feedback from the teaching staff.

February 5 at 11:59pm: Paragraph submission on data set and application (3 points). Submit one paragraph describing the question you intend to explore in your final project. Your paragraph should include (1) information about the datasets you will analyze; (2) initial ideas for the social/ethical analysis you plan to conduct; (3) names of your team members (if you have already identified any), and , and (4) how certain you are that you will do this project. Also post your paragraph submission under the bcourses discussion thread for final project milestone #1 so other students can take a look before the in-class final project mixer. This is a soft commitment, your project proposal can change up until February 19. Everyone needs to write a submit this assignment individually, even if they have already identified final project teammates.

February 19 at 11:59pm: Deadline to commit to a project idea, and submit a one-page proposal (7 points). Submit a one-page proposal identifying your team, data source, application area, plan for data analysis, and plan for social/ethical analysis. Also include your stakeholder map (which will be used for in-class labs in the next couple of weeks). Complete the stakeholder map as a group — while the stakeholder map as an artifact is useful in itself, the discussions that you have while creating the stakeholder map are often most fruitful. Stakeholder maps should include the following:

Identify stakeholders – this should include direct stakeholders (people who directly interact with your technology / system) as well as indirect stakeholder (people who may not directly interact with your technology / system but may be impacted none-the-less). For example: if designing medical records, a doctor may be a direct stakeholder because they are directly using medical records (e.g., entering information, retrieving information, etc). A patient may be an indirect stakeholder because although they may never directly interact with the medical record system, they are impacted none-the-less (e.g., perhaps it changes the ways doctors interact with patients). The broader the brainstorm, the better!
Surface values – for each stakeholder, envision what might be important to each stakeholder. A single stakeholder may care about multiple things at once; and at times the things they care about can be in tension with one another. The more values you surface, the better!
Identify value tension – Recall that value tensions can exist within a single individual, between individuals and groups, and between individuals, groups, organizations, governments, institutions, societies.
Reflect: Write write a short (1-3 paragraph) reflection on your experience with the stakeholder mapping activity: did you encounter any challenges or questions while mapping stakeholders? what new questions do you have that you’d want to investigate? what did the process surface that you had previously considered? what did you like about the process? what did you not like about the process? what did the process do well? what did the process not do well? if you used the envisioning cards (optional), what if anything new did that process surface?
Tips: We’d suggest either creating your stakeholder map using physical post-it-notes (like we did in class) where each stakeholder and value are a different sticky note; or you can do a similar activity digitally using Miro which will allow you to create digital post-it-notes. If you are inspired and want to get creative, you may consider tools / approaches from concept mapping. If you’re interested in using the envisioning cards that we used in class to help generate additional stakeholders and values, please reach out to Zoe.

March 18 at 11:59pm: Midterm submission (15 points). Submit a 4-6 page report of your work so far, including (1) an annotated bibliography that summarizes the 5-10 most relevant related papers, (2) at least one technical analysis, (3) at least one social/ethical analysis, and (4) a list of questions that you’d like feedback on from the teaching team.

April 23: Final presentation (15 points). Each group will give a 10 minute presentation on their project, with 4 minutes for Q&A. Your presentation should cover motivation and related work, your research question, data and methods (briefly), results (on both data analysis and social/ethical analysis, though you do not need to cover both in equal depth), and discussion of broader implications and limitations of your work.

May 5 at 11:59pm: Final paper (35 points). The final paper should include both data analysis and social/ethical analysis, and be of sufficient quality to be submitted to a conference, journal, or workshop. Alongside the final paper, students will submit a 1-page reflection on the process of doing technical work alongside social/ethical considerations. This reflection should be written by each student individually. Note: Please include at the top of your submission how you would like us to allocate points in your grade towards your methods and results for data analysis and your methods and results for social/ethical analysis. A total of 15 points are allocated towards these two categories, and a minimum of 3 need to be assigned to each. So, for example, you could assign 3 points to data analysis and 12 points to social/ethical analysis, 12 points to data analysis and 3 points to social/ethical analysis, or anywhere in between.

Project ideas

Students are encouraged to come up with their own project ideas. A few possible project ideas curated by the teaching team are also listed below.

Projects focused more on data analysis

Expanding on satellite-based poverty prediction: Replicate and expand the poverty prediction results from Jean et al. (2016), using the dataset prepared for a problem set in previous iterations of Info288. Possible ideas for expansion: Generate confidence intervals for satellite-based predictions, incorporate other geospatial data sources besides satellite imagery, or implement approaches to spatial cross validation to evaluate model accuracy. Discuss the privacy and contestability implications of measuring poverty with satellite imagery.
Impact evaluation with satellite imagery: Evaluate the impact of an anti-poverty or development intervention using satellite imagery. This will require engaging with research techniques from the literatures on impact evaluation (in either the experimental or quasi-experimental setting, depending on the intervention you study) and inferring poverty or other outcomes from satellite imagery. Consider reading Huang et al. (2021) and Ratledge et al. (2022) for inspiration. It will probably be easiest to access satellite imagery via Google Static Maps API or Google Earth Engine. Discuss the accuracy, privacy, and contestability implications of measuring poverty with satellite imagery.
Benchmarking the accuracy of remotely sensed poverty and vulnerability maps: Choose at least three different satellite-based data products (such as the relative wealth index, gridded human deprivation index, high resolution human development index, MOSAIKS, or data layers from Atlas AI), and benchmark their accuracy using survey datasets from several low- and middle-income countries. Make sure to use datasets that the data products were not trained on. Explore which measures of poverty and vulnerability they can predict well, and which they predict poorly. Consider reading Blumenstock and Smythe (2022) and Sartirano et al. (2023) for data analysis inspiration. Discuss the accuracy, privacy, and contestability implications of measuring poverty with satellite imagery.
Prototypes a rapid damage assessment tool with the Indonesian Red Cross: The IFRC are interested in doing rapid damage assessment post-disaster, using street-level imagery collected by 360-degree cameras. They are planning to conduct a pilot test of this technology in February, and are interested in better understanding what can and cannot be inferred from the imagery. Contact Josh if interested.
Epidemiological prediction with nontraditional data: Build epidemiological prediction models for an outbreak. Identify a ground-truth epidemiological dataset and pair it with google search trends, Facebook mobility data, or other nontraditional data sources. Assess the relationship between digital signals and the spread of disease, and build machine learning models to forecast or nowcast disease prevalence. Consider reading Aiken et al. (2020) or Ilin et al. (2021) for data analysis inspiration. Discuss possible issues with dataset bias and strategies for communicating results with public health officials.
Mobility and natural disasters: Work with a partially-fabricated dataset of call detail records in Rwanda (prepared for a problem set in a previous iteration of Info288) to evaluate the impacts of an earthquake on displacement and mobile phone use. The questions about displacement and phone use posed in the problem set are a starting point, but your analysis should be more rigorous: for example, you could assess the earthquake’s impacts on the network structures of mobile phone subscribers, or measure the pre-quake determinants of who is most likely to migrate post-quake, and for how long. Discuss privacy protection strategies for use of mobile phone records in humanitarian settings llike these, and assess the trade-offs between the accuracy of humanitarian interventions and privacy protection. Contact course staff for access to the partially fabricated dataset.
Algorithmic fairness in sustainability applications: Work with any of the machine learning benchmark datasets listed in the “data sources” section, such as SustainBench, satellite imagery in WILDS, or iWildCam. Assess algorithmic fairness of machine learning models in these settings, implementing a number of different fairness definitions (see Barocas et al. 2020). Quantify the trade-off between fairness and accuracy, and discuss the implications for predictive models in your chosen setting. Discuss the limitations and implications of algorithmic fairness approaches in your setting.

Projects focused more on ethical/social analysis

Data-driven development at the community level: Use and adapt existing methods — or develop your own — to envision how a new data-driven approach to development or humanitarian aid could interact in unexpected ways with people, communities, societies, and/or the environment. For example, Cobb et al. (2016) use existing methods of threat modeling to examine the privacy and security threats that could arise from Open Data Kit. Kaulyaalalwa et al. (2023) created a scavenger hunt that youth in Namibia engaged with to guide the city of Windhoek in its efforts to make the city a ‘smart’ city. This project could take place in parallel to a short data analysis-focused project to ensure the methods are grounded.
Explain data-driven development: Use and adapt existing methods – or develop your own – to explain big data approaches to development to non-technical experts. For example, Kahn et al. (forthcoming) developed visual aids to explain mobile phone metadata and machine learning to people living in rural villages in Togo. Abebe et al. (2021) leverage storytelling as a way to illustrate concerns related to data sharing practices. This project could take place in parallel to a short data analysis-focused project to ensure the methods are grounded.
Explore how a value, method, or discipline that has been developed primarily the ‘west’ may look different in a different place or in the context of big data approaches to development: For example, Sambisavan et al. (2021) explore how definitions and instantiations of fairness in the ‘west’ are inadequate and would need to be adapted to be acceptable in India. Birhane et al. (2022) explore the possibility and potential pitfalls of participatory approaches to AI (with several examples focused in LMICs). Conduct a short data analysis quantifying at least two definitions of the value or method (one from a more ‘western’ lens, and one that is more situated).
Surface “new” values that arise in the context of big data approaches to development: Consider values beyond those commonly raised in the discussion of responsible tech (i.e., fairness, transparency, accountability, privacy) to consider other values that may be important to surface. For example, Guardia et al. (2-022) discuss the importance of social cohesion in targeting cash transfers. Wein (2022) explores dignity in the context of development. Conduct a short data analysis using one value commonly raised in the discussion of responsible tech (e.g., fairness) and begin to chart what data analysis might look like when centering another value (e.g., dignity).
Interview domain experts about a particular new use of big data approaches to development, or a particular social or ethical issue. For example, Taylor (2014) conducted interviews to understand the specific ethical concerns that arise from tracking mobility from mobile phone metadata. Conduct a short data analysis related to the interview topic. For example, if you were looking at ethical concerns that arise from tracking mobility from mobile phone metadata, you could use the dataset of partially fabricated mobile phone metadata to analyze a few mobility patterns (contact course staff for access to this dataset).
Interview experiential experts about a particular use of big data approaches to development, or a particular social or ethical issue. For example, Kahn et al. (forthcoming) conducted interviews with people living in rural villages in Togo to understand the data privacy concerns that arise related to the use of mobile phone metadata and machine learning to inform approaches to development policy. While it may not be possible to access people living in LMICs, you could do a pilot study with people nearby — or consider interviewing people who may be part of different communities related to the particular topic (e.g., Togolese living in the United States). Conduct a short data analysis related to the interview topic. For example, if you were looking at ethical concerns that arise from tracking mobility from mobile phone metadata, you could use the dataset of partially fabricated mobile phone metadata to analyze a few mobility patterns (contact course staff for access to this dataset).
AI Regulation in LMICs: Many low and middle-income countries are currently drafting new AI regulations. For example, this article maps the AI regulatory landscape in Africa related to healthcare. One concern that has been raised is that many African countries are adopting AI regulations that were largely developed in the US and EU (GDPR in particular), and are therefore not responsive to local norms. A project could analyze a handful of new AI regulations in LMICs, assessing in what ways these regulations draw on prior regulation developed in the US and EU, and speculate about the places where these new regulations may capture — and fail to capture — local norms. This project could be paired with a short data analysis case study. For example, because satellite imagery and mobile phone metadata are quite different pieces of data, it might be interesting to consider both in an analysis of AI regulation, so you could conduct a short data analysis using both types of data.

Data source ideas

All projects are expected to include a data analysis component. Here is list of publicly available datasets that may be useful for the project. Students are also welcome to use other data sources. If you find a useful data source, please share it with the class on bCourses discussions!

Survey data

Demographic and Health Surveys
Living Standard and Measurement Surveys
Living Standards Measurement Study – Integrated Surveys on Agriculture (LSMS-ISA) (LSMS panel surveys)
IPUMS (census microdata)

Satellite imagery

Google Static Maps API
NASA/NOAA nightlights data
Google earth engine
MOSAIKS API for global pre-processed satellite imagery embeddings

Data products derived from satellite imagery

Relative wealth index
High resolution estimates of the human development index
Global gridded deprivation index
High resolution population density maps

Web and social media data

Google trends
Google street view API
Other Google APIs
Facebook data for good movement range maps
Other publicly available datasets from Facebook data for good (population, infrastructure, connectivity, COVID-19)

Climate and environment data

Climate Data Online
MERRA-2 database
iNaturalist dataset
iWildCam dataset – includes satellite imagery and camera trap images

Violence and conflict data

Machine learning benchmark datasets

SustainBench – Benchmarks relating to SDGs
WILDS – benchmarking data, including LANDSAT for poverty prediction
Wild-Time – Time series benchmarking data, including LANDSAT and healthcare-related tasks

Large repositories of different datasets

Humanitarian data exchange
UN Data (also look at directorates such as UNHCR Popstats, the IOM, the Situations DB, etc.)
World Bank Open Data. portal
USAID developer resources
SHRUG data on India: Includes data on population, health, education, night lights