DIGITAL EPIDEMIOLOGY

PROGRAM
 

* PART 1 (8 points)
 

1.1 - Consider either a synthetic Barabasi-Albert graph, a real social network (e.g., from the SNAP or Konect network data repositories), or a sample of a real social network. If the chosen graph has multiple disconnected components, select its largest connected component. Make sure your graph has at least a few 1000s nodes. Compute and plot the degree distribution of the graph.
 

1.2 - Set up a simulation of an SIR epidemic model on the graph. Choose values of the model parameters \beta and \mu that allow the epidemic to take off with high probability, reaching most of the graph. Plot the epidemic curve.
 

* PART 2 (12 points)
 

2.1 - Select a random set of N nodes of the graph, which we well refer to as "sentinel nodes". Modify the simulation so that you record the arrival time of the epidemic at these nodes (i.e., the discrete time step at which each of those nodes transitions from the susceptible to the infectious state). Compute the distribution of arrival times for many realizations of the epidemic (~hundreds). Display the resulting distribution using a boxplot.
 

2.2 - Consider many random choices for the N sentinel nodes, as well as many epidemics (as already done above), and recompute the above distribution of arrival times (so that this times, it also takes into account the randomness due to the random selection of sentinel nodes).
 

2.3 - Compute a few centrality metrics of your choice for all the nodes of the network (e.g., degree centrality, betweenness centrality, PageRank centrality, etc.). For each of these metrics, compute a global ranking of nodes and choose as sentinels the top-N and bottom-N nodes of the ranking. Re-run the simulations of point 1.3 and see how the resulting distribution of arrival times is affected. Build a figure with one boxplot for each strategy for choosing the sentinel nodes (e.g., "random" vs "top-degree" vs ...). Discuss your results.
 

* PART 3 (6 points)
 

3.1 - Select N nodes at random as sentinels. Imagine that you do not have global information about the graph, and that you have information about the graph neighborhoods of the chosen nodes. That is, you know the identity of the randomly chosen nodes, the identify of their neighbors, and the edges among the latter. Starting from the initial random set of N nodes, can you build another set of N nodes that improves (reduced) the average detection time of the epidemic ?
 

3.2 - If you have solved 3.1, explain why your strategies work.
 

* BONUS tasks (total bonus 8 points max)
 

- working with a graph larger than 1 million nodes (5 points)
 

- code length shorter than 100 lines (language of your choice), excluding empty lines, comments, statements to import libraries, printing out results and figure plotting (3 points)

- nicely commented notebook with accurately discussed results (3 points)

- timely completion of the assignment by due date November 21st (2 points)
 

NOTES:

- you can use any programming language(s) you like

- a single self-contained Jupyter Notebook is recommended but not compulsory and does not give you extra points

- use your preferred tools to solve the assignment!

Riferimenti al materiale didattico (1):
 

* INTRODUCTION
 

The goal of this assignment is to try and reproduce for Italy

the results reported in this article for the USA:
 

[1] D.J. McIver & J. S. Brownstein (2014), "Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time", PLoS Comput Biol 10(4): e1003581

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.10...

You will need:

1) OFFICIAL DATA ON INFLUENZA IN ITALY. The Italian health protection agency

runs a flu surveillance program called "Influnet" that uses sentinel doctors.

The project is described here: http://www.iss.it/iflu/index.php?lang=1&anno=2016&tipo=4

You can find the data in the Dropbox folder of the course,

under "Assignments/influnet".

The official data are reported in PDF files (see the "PDF" folders for examples)

but a digitized version of the data is available for you under "data".

You have one file for every flu season.

You only need the 1st nd 5th columns of each file that are,

respectively, the week of the year (https://en.wikipedia.org/wiki/ISO_week_date)

and the estimated flu incidence for that week (i.e., fraction in the population

of new weekly cases; watch out, the decimal separator is a comma instead of a period).

These data will be your ground truth.

 

2) WIKIPEDIA PAGE VIEW DATA. Wikimedia Foundation makes available several

datasets, tools and APIs to work with page view data. A summary can be found here:

https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics

As an example, here you can see the "Influenza" and "Febbre" (fever)

page view counts for the Italian Wikipedia:

http://bit.ly/2gpU5RK

The raw data are available here (some changes in format/location happened recently):

- until August 2016: https://dumps.wikimedia.org/other/pagecounts-raw/

- after August 2016: https://dumps.wikimedia.org/other/pageviews/

There is one compressed (gzip) file per hour (UTC time).

Within a file, every row reports the number of hourly views for every page,

according to the format:

"it Influenza 8 0"

You only need the first three columns: the first is the language of the Wikipedia

you are interested: you need to consider only the rows that have "it" or "it.m" here

("it.m" is the mobile version of the Wikipedia page). The 2nd column contains

the name of the Wikipedia page ("Influenza" in the Italian Wikipedia in this case),

and the 3rd columns if the number of page views (8 in this case)

for that page during the hour the file corresponds to.

You need to aggregate this data at the weekly scale,

so that you can compare it with the official data described above.

We are interested in the Italian Wikipedia only, so select only rows

that begin with "it" or "it.m" and sum the pageview values for each page of interest.

You can also use third-party tools, such as this,

http://bit.ly/2g0vPoA

that do most of the work for you and allow you to download

weeekly pageview data in CSV for the pages you choose

(if you work on the raw data, however, you'll be able to compute

more interesting things and you will get extra points for this assignment).
 

* PART 1 (10 points)

1.1 - Process the Wikipedia pageview data for the "Influenza" page of the Italian

Wikipedia (https://it.wikipedia.org/wiki/Influenza), aggregate the pageviews

on a weekly time scale, and plot the resulting time series of page views

for the current year and - ideally - also for previous years.

 

1.2 - Compare the time series from the official Influnet surveillance system

with the time series of pageviews obtained in 1.1.

Compute some measure of correlation between the two time series.

Is the correlation significant ?

 

* PART 2 (10 points)

2.1 - Try to find other Wikipedia pages related to flu whose pageview time series

are correlated with the Influnet signal. For example, you could consider

the pages linked from the "Influenza" page, such as symptom pages:

- https://it.wikipedia.org/wiki/Febbre

- https://it.wikipedia.org/wiki/Rinorrea

- https://it.wikipedia.org/wiki/Mialgia

- https://it.wikipedia.org/wiki/Cefalea

- https://it.wikipedia.org/wiki/Vomito

pages corresponding to medications for fever:

- https://it.wikipedia.org/wiki/Paracetamolo

and pages devoted to flu vaccines:

- https://it.wikipedia.org/wiki/Vaccino_antinfluenzale#Vaccino_influenza_s...

Use any strategy you think is appropriate to choose these pages.

Compute their weekly pageview time series for the last year

and - if possible - for the previous years,

and plot them together with the Influnet signal as in 1.1.

2.2 - On aggregating the hourly pageview data to compute weekly pageviews,
can you do something to select page views from Italy?

 

2.3 - For each of the selected Wikipedia pages, compute the same correlation

with the Influnet time series that you computed in 1.2.

Which of these correlations is significant?

Did you discover a better page than "Influenza"

in terms of correlation with the ground truth ?

* PART 3 (15 points)
3.1 - Build a regression model that predicts the Influnet incidence

for a given week based on the Wikipedia pageview data for the same week.

Your features are the Wikipedia pageview counts for the "Influenza" page,

for all the pages you have selected in Part 2,

and for any other page that you think might help (there are probably

global trends that have nothing to do with influenza,

and there might be ways to control for them in your model.)

Carry out any feature selection you think it's appropriate.

Evaluate the performance of your model via cross-validation.

3.2 - Add these features to your model:
- the Influnet incidence for the week preceding the target week
- the pageview counts for all the pages you selected for the week preceding the target week
Re-train your model and evaluate its performance via cross-validation.
Did it improve ?
How does it compare with the results reported in article [1] ?

* BONUS tasks (total bonus 10 points max)

- working with raw Wikipedia pageview data (4 points)

- nicely commented notebook with accurately discussed results (3 points)

- timely completion of the assignment by due date December 16th (3 points)

==> NOTES <==

- you need to code the assignment in Python

- a single self-contained Jupyter/IPython Notebook is required

- for space reasons, you don't have to copy the raw data into your Dropbox folder

(we will not re-exectue your Notebook in most cases)

Riferimenti al materiale didattico (2):

* Introduction to epidemiology (4 hours)

- history and evolution of epidemiology

- theoretical epidemiology, applied epidemiology, digital epidemiology

- core epidemiologic functions: public health surveillance, field investigation, analytic studies, evaluation and linkage, policy development
 

Materials: course slides + CDC self-study course SS198 book:

https://www.cdc.gov/ophss/csels/dsepd/ss1978/index.html

https://www.cdc.gov/ophss/csels/dsepd/ss1978/SS1978.pdf

* Statistical methods in epidemiology (6 hours)

- descriptive epidemiology vs analytic epidemiology

- observational studies: cohort studies, cross-sectional studies, case-control studies

- disease occurrence and progression, natural history of a disease

- epidemic diseases, stages of an epidemic outbreak

- interventions

Materials: same as above

* Mathematical models for infectious diseases (10 hours)

- basic concepts in network science and social network analysis

- degree distributions, measures of centrality, clusters and communities

- features of real social networks and contact networks

- generative graph models

- compartmental models

- epidemic processes on networks, epidemic threshold, reproductive number

- network heterogenity & epidemic thresholds

- SIR / SIS / SEIR epidemic models

- metapopulation models & spatial models
 

Materials: course slides

+ M.J. Keeling & P. Rohani, "Modeling Infectious Diseases in Humans and Animals", Princeton University Press

+ D. Easley & J. Kleinberg, "Networks, Crowds, and Markets", Cambridge University Press
 

* Social networks & epidemiology (6 hours)

- homophily, social selection, social influence, latent homophily

- misinformation, opinion dynamics, echo chambers, vaccine adherence

- computational social science & epidemiology

Materials: same as above

* Digital disease detection (6 hours)

- patterns for digital disease detection: proxy data, ground truth data sources, machine learning models

- the case of Google Flu Trends

- influenza prevalence from search engine queries (autoregressive model)

- influenza prevalence from Wikipedia page view data (generalized linear model)

- influenza prevalence from symptom mentions in Twitter (SVR model)

- vaccination coverage from sentiment analysis & geocoding of Twitter messages, opinions and polarization in relation to social network structure

- obesity surveillance using location-based social network data

- pharmacovigilance: using search engine queries to discover adverse drug interactions

Materials: course slides

+ M. Salathé et al. (2012) Digital Epidemiology. PLoS Comput Biol 8(7): e1002616

+ 8-10 research articles reporting on the chosen case studies

* Surveillance (4 hours)

- outbreak investigation systems: Healthmap, Promed, etc.

- the architecture of Healthmap

- syndromic surveillance using Web-based cohorts, early warning sentinels

- influenza prevalence from Web-based self-reported syndromic data

- validation
 

Material: course slides

+ 2-3 research articles reporting on the chosen case studies

* next-generation data sources (4 hours)

- state of the art of sensors and wearables

- human contact networks from smartphone data, CDR data, wearable devices

- the future of clinical and physiological sensing

Material: course slides

Modalità di esame:

due assignments, uno a metà corso (digepi_assignment_1.txt, attached)

e uno a fine corso (digepi_assignment_2.txt, attached)

Valutazione condotta secondo i criteri specificati nell'assignment.

Correzione dettagliata/personale in classe del primo assignment.

Correzione personale via email del secondo assignment.