CitiBike analysis with clustering & network science.

In this homework, you will process bike trip data collected from New York City, use network science methods to quantitatively and qualitatively analyze the data, and use different clustering methods.

In this problem, you will study CitiBike data using network science methods covered in class. The data is publicly available on:

https://www.citibikenyc.com/system-data

On this website, you can find trip history data from:

https://s3.amazonaws.com/tripdata/index.html

Please choose a file from the list to work on. Once you have your data identified, use networkx (https://networkx.github.io/) to study properties such as (but not restricted to) degree distribution, connectivity, average path length, centrality measures, Pagerank, HITS scores, and clustering coefficient. Also attempt to visualize the network and ascertain if you can understand its structure. (To visualize the network, you just need to sample 10~30 stations.)

Write a report summarizing the network using concepts we have learnt in class. The report should outline

Which file you are using.
Types of data processing you conducted.
Results of your analysis.
Qualitative summary of the dataset.

You can find documentation and tutorial from: https://networkx.github.io/documentation/networkx-1.9/reference/algorithms.html https://networkx.github.io/documentation/stable/tutorial.html

An optional helper script (citibike_helpers.py) is available on Canvas. The script is only for reference. You are encouraged to preprocess the data by yourself.

In this problem, you will explore cluster properties of bike stations using two different clustering methods: k-means and DBSCAN based on latitudes and longitudes of them.

Get latitudes and longitudes of all stations from your data.

Run clustering methods and get cluster labels of these stations. To select the best k for k-means, use a scree plot with ‘inertia’ in the y-axis.

Note: Inertia is defined as the sum of squared distances of samples to their closest cluster center and is included in the scikit-learn implementation.

For each clustering technique, make a scatter plot (x- and y-axis are latitude and longitude, respectively). Stations in different clusters should be labeled with different colors.

Write a report comparing the results of two clustering methods, including:

The scatter plots.
Which method performs better? And why?

Here are links to k-means and DBSCAN:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Urban computing brings with it several privacy and ethical considerations. Identify one urban computing theme/domain you are passionate about and conduct a literature or news survey about privacy and ethical issues in that domain. Present a report summarizing i) what is known or understood about privacy and ethical issues in that domain, ii) what are current best practices, and iii) your critical assessment of the state-of-the-art and opinions about ethical issues.

One of the important problems in epidemiology is to select nodes to immunize in a network such that the spread of disease is limited. Given a network of people, provide two strategies (and explain them) to immunize the minimum number of nodes such that the least number of nodes in the overall network are infected. Define what you mean by “limiting the spread of disease”. Use graph-theoretic concepts covered in class.

Overview of the data and methods

The data selected for this task is JC-201801-citibike-tripdata.csv, which comprises of 12,677 observations. The type of data processing conducted in the study was scientific data processing which comprised in careful analysis of data to ascertain their relations by using descriptive statistics which described the nature of data through their measures of central tendencies.

The data was analyzed and the following results were found, the mean latitude for the start was 40.722 and also the same figure for the end station, which shows the two points were equally geographically located though they are fairly dispersed. On the analysis of the degree distributions basing on the longitudes, it was also found out that the start station and the end station had the same longitude at -74.046. The average birth year for the cyclists was 1979. Gender-wise, they were 9798 men present compared to 2451 and 428 persons whose gender was unknown.

Page rank score is

0: 0.0987654654536564,
1: 0.0556764852533565,
2: 0.0585635463463563,
3: 0.0345435653753654,
...

The average path length was

[(0,
{0: 0,
1: 1,
2: 1,
3: 1,
4: 1,
...

Clustering coefficient was

0.134567855665,
0.25354644643757,
0.04,
0.0567363547844447,
0.05656675367653882,
0.04565757464,
0.03,
...

Average clustering coefficient

0.088153569

On the centrality measure four different items were tested including Degree Centrality, Eigenvalue Centrality, Proximity Centrality and finally the Betweenness Centrality.

latitude 40.722

longitude -74.032

dtype: float64

latitude 0.007

longitude 0.011

dtype: float64

latitude 40.693

longitude -74.097

dtype: float64

latitude 40.749

longitude -74.032

dtype: float64

The two clustering methods were calculated using Dbscan and K means in R studio application. The data set had a large number of observations and three clusters were created for each cluster. In k-means, you initialize cluster defined centers and then find distance between point and cluster, this disadvantageous because it attempts to create equal class clusters irrespective of the spread of data. This effect makes the Dbscan scan a better clustering method since it able to solve the k-means problems by using density of points to create clusters.

Dbscan cluster

k means cluster plot

The rapid progress in urbanization within the last two decades has led to immense modernization in all aspects of human life. However, despite all the benefits that we currently enjoying as a result of modernization, it is a well-known fact that modernization has given rise to some negative issues such as traffic congestions in urban centres, massive environmental pollution and high energy consumption among other issues. Naturally, in a civilized society, whenever we face challenges the natural human instinct is to try and generate a solution and in the case of urbanization the solution was urban computing. So what is urban computing? Urban computing is defined as a multi-themed arena which involves the study and application of modern computing technology in try to tackle negative issues affecting urban areas. Urban computing was invented with a goal tackling issues like traffic congestion in urban areas by using the data that is generated in the urban centres. Urban computing is basically an interdisciplinary field where computer science knowledge is applied in tackling problems in conventional urban centre related fields such as transportation, economy, environment, ecology, sociology and civil engineering (Yuan, 2012). Urban computing makes use of different sets of data obtained from a variety of sources such as sensors, mobile devices, vehicles, buildings and human beings. These data sets include, traffic flow data, topographical data and human interaction data among others. Urban computing systems were designed to link unobtrusive and ubiquitous sensing technologies, advanced data management systems, data analytics models and unique visualization methods with an aim of creating a win-win solution that will essentially enhance the overall urban environment, the quality of life for all human beings and general city operation systems (Noulas, 2011).

Summary of network analysis using networkx

One of the major and most interesting themes of urban computing is transportation. Urban computing is mostly used by municipalities in efforts to improve public and private transportation systems in urban areas. The primary sources of data that are used in the transportation field of urban computing are floating car data which refers to sets of data about where cars are located at any given moment. These pieces of data are obtained from individual GPS’s, taxi GPS’s, WIFI signals, loop signals and user input derived from various applications. Urban computing has a wide range of benefits to the transportation sector in the sense that, it helps drivers to choose better driving routes, an aspect that is of paramount importance especially for software application platforms such as Google Maps, trip planning and Waze. Uber which is an on demand taxi-like service where users can request for rides on their smartphones is a good example of the benefits of urban computing on transportation. Basically, by using data derived from active riders and drivers, Uber can implement price discrimination mechanisms based on real-time rider versus driver ratios (Zheng, 2011). This enables the company to make more money than they normally would without price discrimination, moreover, it allows them to regulate the flow of drivers on the street based on low and high working periods. Urban computing can also be applied in efforts to reduce the cost of urban transport. For instance, in London, the cycle hire system somehow increased weekday commutes and heavily increased bike usage in the weekend hence decreasing traffic congestion within the city. Despite the numerous benefits of urban computing, the fact that it is a product of big data mining brings along a substantial level of ethical issues that need to be addressed.

The fact that urban computing relies on information collected from various people and entities raises a number of concerns revolving around privacy and other ethical issues. So what are these so-called ethics? Generally speaking, a person’s ethical actions can be said to be, actions which are performed within the context of what any demographic society considers right or good. The main area of concern when it comes to data science frameworks like urban computing is the protection of personal data. The concept of protection of personal data which is also known as privacy relates to the possibility of the accessibility and the manipulation of personal information. Urban computing creates the possibility of access of information by a wider network of individuals since by implication, once data is collected and stored in one specific platform; it becomes easier for random individuals to gain access to a person’s private information (Cranshaw, 2012). During the process of handling of private information urban computing professionals are confronted with a number of ethical issues namely:

The issue of specific importance in situations where an urban computing professional comes across personal data that can have direct influence on a person’s life.
The issue of whether or not an urban computing professional may use any of the four categories of private data sets for any other reason rather than for use in formulation of proper transport systems. This also raises a question on whether or not individuals should be notified about the manner in which sets data obtained from them are to be used.
The issue of making decisions on which categories personal data the urban computing professional is entitled to gather.
The issue of confidential treatment of the data obtained from various individuals.
The issue of the right of an individual pertaining the terms of use of the collected data and the distribution of personal data. This ethical issue boils down to the issue of consent of the individual from whom the data was obtained from in terms of the use of personal data.

Summary of clustering analysis using K-means and DBSCAN

Apart from privacy urban computing also raises another major ethical issue in the context that it fosters the concept of automated decision making. The capability to make decisions between alternate possibilities has, for a long time, been considered as the main aspect that separates human beings from machines. Urban computing crosses this threshold therefore introducing numerous ethical considerations that include:

Whether or not the economic and social organizations within an urban area accept and rely on the very complex methodologies of urban computing that most of them probably do not understand.
Whether or not the populace are willing to accept the applications of urban computing which are naturally generated from their past experience hence making the same populace prisoners of their own past and restraining their potential for growth and diversity.
The essential logic of urban computing can be gamed hence creating a loop hole for urban computing practitioners to cheat the system.

The application of micro-targeting in urban computing also raises another ethical concern. It is indeed true that micro-targeting techniques are very robust tools of influence in urban planning especially in the field of economics. Despite the fact that the ability of people to exercise free will has long been a subject of debate, the practice of imposing a particular set of guidelines on urban dwellers worsens the situation because it is more morally questionable. Furthermore, micro targeting techniques allow urban computing practitioners to deduce sensitive information and personal preferences even when such data is not necessarily captured. To conclude, since urban dwellers become “the product”, there is a significant risk that urban computing firms will use their data and influence less to improve the standards in urban areas than to turn urban societies into products of urban computing manipulation.

As we have seen, urban computing was formulated with an aim of improving people’s quality of life in urban areas. So, the question is, has urban computing met its objectives. To answer this question we need to look at the current best practices of urban computing, this raises another question. What are best practices? Basically, a best practice is any action that is carried out with total efficiency in managing the available resources with a criterion that is based on a state –of –the- art governance in its design and development and is able to offer a large contribution towards improving living conditions and overall human development. So what are some of the best practices of urban computing? Malaga city council’s project is a noteworthy best practice of urban computing. The project’s objectives were:

To come up with 16 projects that proposes solutions to the integration of river Guadalmedina.
To build the largest digital library on the river.
To open the way for dialogue and consensus between the public sector and the economic and social actors in the city of Malaga.

Currently, the project can be rated as a success because, at the moment, the project is on course and the city together with the project implementing parties have become drivers of the network of cities with strategic urban planning models in northern Morocco (Rubinstein, 2012).

In conclusion, the use of urban computing in the planning of urban areas poses a very significant ethical question with regard to people’s right to privacy which is very important because it is directly linked to other rights and freedoms such as the right to freedom and human autonomy. The privacy issues of urban computing relate mainly to the accessibility of information and the manipulation of the said information. If practical guidelines in the handling of these pieces of information were to be formulated in line with the norms of freedom, truth and human dignity, they would go a long way in alleviating the ethical issues associated with urban computing.

Privacy and ethical considerations in urban computing.

Whenever there is an infectious disease outbreak, network-based interventions are considered as the most robust methods when the demographics of the full network are known (Ajenjo, 2010). Nevertheless, practically speaking, the resource constraints associated with large scale immunization call for decisions to be made based on the available partial network information. The concept of herd immunity is a well-known signature characteristic of vaccination geared towards the prevention of infectious disease outbreaks. The concept of herd immunity is applied based on the notion that, not everyone in the population in a community needs to receive some sort of preventive intervention in order to substantially reduce epidemic severity (Apiella, 2012). Herd immunity enables medical practitioners to save both time and resources that would have otherwise been invested towards vaccinating each and every person. Basically, herd immunity is implemented by carefully targeting vaccinations to maximize the effect of partial immunization of the population. Various strategies of immunization can be used to enact the concept of herd immunity when it is not possible to vaccinate all the members within a network or community due to constraints brought about by the high cost of vaccines or perhaps difficulty in the supply of vaccines (Chistakis, 2010). Some of the most common targeting approaches include concentrating on populations that are facing the highest risk of death if they were to be infected or on populations who bear the highest chance of transmitting to other individuals who face a high mortality risk. The main goal of herd immunity is to limit the spread of disease. So, what does ‘limiting the spread of disease’ mean? Basically, limiting the spread of disease refers to the prevention of disease pathogens from moving from one individual to the next. For herd immunization to work, medical practitioners must apply some specific strategies to ensure that they immunize the minimum number of nodes such that, the least number of nodes in the overall network are infected (Eames, 2009). The following strategies have been tested and proven to provide the best results:

This immunization strategy is done through contact tracing on the susceptible-infected –recovered framework. First, regression analysis is used to predict the percentage of individuals ever infected based on network properties for simulated data-sets. Next, using the simulated percentage of the empirical networks at baseline, vaccines are selected through any of the five network based approaches which are:

Random individuals
Random high degree individuals
Highest degree individuals
Random contacts of random individuals
Most central individuals

A fixed choice design is a network design where the identical members of a network otherwise known as respondents, are issued a maximum number of contacts they can name hence reducing the time taken to conduct interviews that are meant to figure out the possible nodes in a network that may be infected.

References

Ajenjo, M. C., Woeltje, K. F., Babcock, H. M., Gemeinhart, N., Jones, M., & Fraser, V. J. (2010). Influenza vaccination among healthcare workers: ten-year experience of a large healthcare organization. Infection Control & Hospital Epidemiology, 31(3), 233-240.

Apicella, C. L., Marlowe, F. W., Fowler, J. H., & Christakis, N. A. (2012). Social networks and cooperation in hunter-gatherers. Nature, 481(7382), 497.

Banerjee, A., Chandrasekhar, A. G., Duflo, E., & Jackson, M. O. (2013). The diffusion of microfinance. Science, 341(6144), 1236498.

Christakis, N. A., & Fowler, J. H. (2010). Social network sensors for early detection of contagious outbreaks. PloS one, 5(9), e12948.

Cranshaw, J., Schwartz, R., Hong, J., & Sadeh, N. (2012, May). The livehoods project: Utilizing social media to understand the dynamics of a city. In Sixth International AAAI Conference on Weblogs and Social Media.

Eames, K. T., Read, J. M., & Edmunds, W. J. (2009). Epidemic prediction and control in weighted networks. Epidemics, 1(1), 70-76.

Noulas, A., Scellato, S., Mascolo, C., & Pontil, M. (2011, July). Exploiting semantic annotations for clustering geographic areas and users in location-based social networks. In Fifth International AAAI Conference on Weblogs and Social Media.

Wu, H. Y., Rubinstein, M., Shih, E., Guttag, J., Durand, F., & Freeman, W. (2012). Eulerian video magnification for revealing subtle changes in the world.

Yuan, J., Zheng, Y., & Xie, X. (2012, August). Discovering regions of different functions in a city using human mobility and POIs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 186-194). ACM.

Zheng, Y., Liu, Y., Yuan, J., & Xie, X. (2011, September). Urban computing with taxicabs. In Proceedings of the 13th international conference on Ubiquitous computing (pp. 89-98). ACM.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2020). Analyzing CitiBike Data Using Network Science Methods And Clustering Algorithms. Retrieved from https://myassignmenthelp.com/free-samples/cs-5834-introduction-to-urban-computing/citibike-data-using-network-science-methods.html.

"Analyzing CitiBike Data Using Network Science Methods And Clustering Algorithms." My Assignment Help, 2020, https://myassignmenthelp.com/free-samples/cs-5834-introduction-to-urban-computing/citibike-data-using-network-science-methods.html.

My Assignment Help (2020) Analyzing CitiBike Data Using Network Science Methods And Clustering Algorithms [Online]. Available from: https://myassignmenthelp.com/free-samples/cs-5834-introduction-to-urban-computing/citibike-data-using-network-science-methods.html
[Accessed 26 April 2024].

My Assignment Help. 'Analyzing CitiBike Data Using Network Science Methods And Clustering Algorithms' (My Assignment Help, 2020) <https://myassignmenthelp.com/free-samples/cs-5834-introduction-to-urban-computing/citibike-data-using-network-science-methods.html> accessed 26 April 2024.

My Assignment Help. Analyzing CitiBike Data Using Network Science Methods And Clustering Algorithms [Internet]. My Assignment Help. 2020 [cited 26 April 2024]. Available from: https://myassignmenthelp.com/free-samples/cs-5834-introduction-to-urban-computing/citibike-data-using-network-science-methods.html.

Get instant help from 5000+ experts for

Writing Rewriting Editing

Subject/course code

❮ ❯

Pages

250 words

Order description (write/attach)

Attach file

I accept the T&C, agree to receive offers & updates

Have a coupon code?