# C5834 Introduction To Urban Computing

• Course Code: CS 5834
• University: Victoria University
• Country: Australia

## Question:

In this homework, you will process bike trip data collected from New York City, use network science methods to quantitatively and qualitatively analyze the data, and use different clustering methods.

### Problem 1. Networks

In this problem, you will study CitiBike data using network science methods covered in class. The data is publicly available on:

https://www.citibikenyc.com/system-data

On this website, you can find trip history data from:

https://s3.amazonaws.com/tripdata/index.html

Please choose a file from the list to work on. Once you have your data identified, use networkx (https://networkx.github.io/) to study properties such as (but not restricted to) degree distribution, connectivity, average path length, centrality measures, Pagerank, HITS scores, and clustering coefficient. Also attempt to visualize the network and ascertain if you can understand its structure. (To visualize the network, you just need to sample 10~30 stations.)

Write a report summarizing the network using concepts we have learnt in class. The report should outline

• Which file you are using.
• Types of data processing you conducted.
• Qualitative summary of the dataset.

You can find documentation and tutorial from: https://networkx.github.io/documentation/networkx-1.9/reference/algorithms.html https://networkx.github.io/documentation/stable/tutorial.html

An optional helper script (citibike_helpers.py) is available on Canvas. The script is only for reference. You are encouraged to preprocess the data by yourself.

### Problem 2. Clustering

In this problem, you will explore cluster properties of bike stations using two different clustering methods: k-means and DBSCAN based on latitudes and longitudes of them.

1. Get latitudes and longitudes of all stations from your data.
1. Run clustering methods and get cluster labels of these stations. To select the best k for k-means, use a scree plot with ‘inertia’ in the y-axis.

Note: Inertia is defined as the sum of squared distances of samples to their closest cluster center and is included in the scikit-learn implementation.

1. For each clustering technique, make a scatter plot (x- and y-axis are latitude and longitude, respectively). Stations in different clusters should be labeled with different colors.

Write a report comparing the results of two clustering methods, including:

1. The scatter plots.
2. Which method performs better? And why?

Here are links to k-means and DBSCAN:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

### Problem 3. Urban Computing and Ethics

Urban computing brings with it several privacy and ethical considerations. Identify one urban computing theme/domain you are passionate about and conduct a literature or news survey about privacy and ethical issues in that domain. Present a report summarizing i) what is known or understood about privacy and ethical issues in that domain, ii) what are current best practices, and iii) your critical assessment of the state-of-the-art and opinions about ethical issues.

### Problem 4. Immunization

One of the important problems in epidemiology is to select nodes to immunize in a network such that the spread of disease is limited. Given a network of people, provide two strategies (and explain them) to immunize the minimum number of nodes such that the least number of nodes in the overall network are infected. Define what you mean by “limiting the spread of disease”. Use graph-theoretic concepts covered in class.

