Recommending quiet paths in Shanghai: a Data Science point of view

By Guillaume Salha on 13 Oct 2015

Walking woman in city

Walking in town. Have you already felt that feeling of frustration or oppression caused by a stressful environment? Air pollution, noise or crowd are all sources of stress able to spoil your walking activity. All the while walking is essential for healthy life as much as for the economic and social life. The city of Shanghai (上海), with more than 23.5 million official citizens in 2012, does not escape this reality.

However, most maps and path suggestions apps do not take into account these aspects. When you look for a path between two places, you are only proposed the shortest path between these two places. The goal of this study was to use data science to propose quality paths for pedestrians within Shanghai: for example the most quiet path, the most beautiful path or the most touristic path.

State of the art and data

In order to determine a scientific methodology to obtain such quality paths, we studied previous research works. We were in particular interested in the results obtained by Quercia, Schifanella and Aiello (2014) [10] within Yahoo’s laboratory in Barcelona. They worked on a very close urban paths recommendation problem for the city of London. They were themselves inspired by Van Canneyt and al. (2011) [6] and Peregrino and al. (2012) [8]. Cheng et al. (2011) [7] as well as Yoon and al. (2010, 2012) [5][9] also provided interesting approaches. We will not discuss each article in detail here, but you can check this very interesting TED talk: Happy Maps by Daniele Quercia in 2014.

Besides, we had at our disposal a large variety of geolocalized data. We retrieved data from Chinese social networks Weibo (messages posted by users in a geolocalized place of Shanghai), Dianping (rankings of geolocalized Points Of Interests such as restaurants), and from the image hosting website FlickR (titles, hashtags and geolocalizations of photos). We also obtained environmental data from the Data Canvas project such as measures of air quality, sound level and humidity. Finally, we had the entire road network of Shanghai, coming from OpenStreetMap coming with a handful of characteristics for each road : number of road lanes, presence of a park or a forest nearby, etc.

Step 1 — Graph Analysis

First we abstracted the road network of Shanghai as a graph using Python and namely NetworkX library. We studied this graph where edges correspond to roads from a statistical point of view. For instance we derived the edge betweenness centrality and used it to infer the intensity of road traffic on these edges. We also associated to each road all characteristics coming from our data. We thus knew whether roads were near a natural place such as a park or a forest, or whether they were pedestrian-only roads.

a shanghai road network graph analysis

Step 3 — Social Media Quiet Positions

We also grouped all FlickR photos and Weibo messages posted near these roads. To determine the closest road for any given latitude/longitude of social media message or photo — in other words, to determine the influence area of each road — we used the concept of Voronoi diagram for line segments. The C++ Boost.Polygon Voronoi implementation of Fortune’s sweep line algorithm especially helped us to construct the diagram on our road network.

Step 4 — Language Analysis

From this point, it was possible for us to construct a quality of walk indicator for roads having enough messages or photos. We decided to use FlickR and Weibo keywords as signals of the road’s quietness quality. The idea is simple: a road is considered as quiet if photos and messages posted nearby contain keywords linked to quietness (in the title or hashtags for photos). We thus proposed an algorithm whose goal is to build such a quality indicator, and we applied it to the research of quiet roads.

This algorithm, coded in Python, takes into account various potential problems. Indeed, how should we take into account the fact that some roads are more frequented than others, and thus than the occurrence of three photos nearby a very isolated road will be a stronger signal than three photos nearby a very busy road? Moreover, beyond the obvious keywords (calm, quiet, peaceful), how should we determine the relevant keywords linked to the idea of quietness? We tried to deal with these problems, obtaining interesting results.

For instance, here are some English keywords related to quietness, according to our algorithm: buddha, calm, chinesegarden, garden, green, meditation, nature, nightwalk, park, peace, peaceful, plants, prayer, quiet, reflexions, tranquil, tree, water. At the same time, we retrieved English keywords opposed to quietness: agitated, cars, circulation, crazy, crowd, crowded, driving, horde, nanjingroad, noisy, people, pollution, shoppingdiscrict, scooter, tired, traffic, unpeaceful, upset.

Of course, we also encountered a lot of Chinese words, coming especially from Weibo data. Text mining in Chinese is complicated because, contrary to English, words (i.e. characters or groups of characters) are not separated by spaces. In order to do text segmentation, one can use for instance the Python natural language toolkit Jieba, based on Hidden Markov Chains.

Step 5 — Modeling Quietness of a Road

This approach allowed us to derive an indicator value for 22.6% of roads. What should be done for the remaining 77.4%, which don’t have enough messages and photos nearby? We developed different ways to model the quality of walk, with an application to quiet paths. The objective was double. First we wanted to understand which variables have an impact on the quality of walk on a road. Second we wanted to be able to predict the value of the indicator for the remaining roads, when it was not possible to directly build it due to lack of data.

We started from a classical multiple linear regression model and went to more advanced methods from machine learning such as random forest. In both cases, we learned on a training set established by roads already having a value for the quietness indicator. The explained variable was of course this indicator and the explanatory variables were the set of roads’ characteristics obtained from our data. We used the R software for that analysis.

Models led to interesting and coherent interpretations. We showed in particular that, other things being equal, roads crossing natural zones are more quiet than the others and that, when the number of road lanes increases, the quietness decreases. We measured these impacts. Variables corresponding to the environmental measures also have a role to play, like road traffic’s intensity and evaluation of the district’s decoration, coming from Dianping. We also showed, using in particular error measures on a test set and ROC curves, that the random forest approach gives us the best predictions: we thus used random forest to make predictions on the remaining 77.4% of roads without quietness indicator.

Step 6 — Visualize Optimal Paths

This work allowed us to now be in possession of a weighted graph, each road having a quietness “grade”. We wished to determine the most quiet walking path between two places of the road network. This is an optimal weighted path finding problem between two nodes of a graph. This is a typical problem in graph theory that one can face using Dijkstra algorithm or Eppstein algorithm.

In our work, we also took into account the important question of the length-quality trade-off. We know that the most quiet path will be, by construction, longer than the shortest path; nevertheless, this path not should be too long and go through the entire city of Shanghai before connecting two points. Once again, the interested reader could refer to our entire report, or to Quercia, Schifanella and Aiello (2014) work based on marginal value theory. In our case, optimal paths can not be 20% longer than shortest paths.

Finally, we propose a visualization of the optimal walking paths on a map. An example is given on the next picture, using CartoDB: the red path is the shortest path and the green path is the most quiet path. We are currently developing an app that will allow pedestrians to directly choose two points on a map and automatically obtain optimal paths between them. They will also retrieve a measure of quality improvement and length. A link will soon be posted on ComplexCity Lab website. A performance evaluation of this final application, by real users, is also in progress. This evaluation is essential to check whether our algorithm really returns quiet paths in practice.

a shanghai alternative quiet path suggestion

Possible Improvements

Our algorithm gives us interesting and coherent results. We are now able to compare paths in term of quietness, to explain which variables have an impact on this quietness, to recommend quiet paths between two places in Shanghai and to visualize the final result on a map. We can also easily adapt the algorithm to different machine learning methods, or to find other kinds of roads namely the most beautiful roads and the most touristic roads

Nevertheless, we have to keep in mind that there is a lot of possible improvements. We tried to have a scientific approach throughout all the analysis. Limits are explained in the full report (contact us). They concern both the available data and the methods. We noticed for example that the environmental data from Data Canvas were not precise, that the road network from OpenStreetMap is probably not totally complete and that we should try to redo web scraping to collect more data (for instance, we only had Weibo data from a part of 2012, we should try a longer period). We also presented a whole series of possible extensions, potentially intended for another study on this topic. With more data, we could have built models taking into account temporal factors. Indeed, the quietness of a road may vary over days or hours. We could also try to replicate our methodology on other cities, and compare the results with Shanghai.

Bibliography

  1. G. Debord, Introduction to a Critique of Urban Geography. LLN 6, 1955
  2. E. Dijkstra, A note on two problems in connexion with graphs. Numerische Mathematik 1: 269-271, 1959
  3. S. Fortune, A sweepline algorithm for Voronoi diagrams. Proceedings of the second annual symposium on Computational Geometry, New York, United States, pp.313-322. 1986
  4. A. De Botton. The Architecture of Happiness. Vintage Series, Knopf Doubleday Publishing Group, 2008
  5. H. Yoon, Y. Zheng, X. Xie, W. Woo, Smart itinerary recommendation based on user-generated GPS trajectories. 7th Conference on Ubiquitous Intelligence and Computing (UIC), 2010
  6. S. Van Canneyt, S. Schockaert, O. Van Laere, B. Dhoedt, Time-dependent recommendation of tourist attractions using Flickr. Belgian/Netherlands Artificial Intelligence Conference (BNAIC), 2011
  7. A.-J. Cheng, Y.-Y. Chen, Y.-T. Huang, W. H. Hsu, H.-Y. M. Liao, Personalized travel recommendation by mining people attributes from community-contributed photos. 9th ACM Int. Conference on Multimedia (MM), 2011
  8. F. S. Peregrino, D. Tomas, P. Clough, F. Llopis, Mapping routes of sentiments. Spanish Conference on Information Retrieval, 2012
  9. H. Yoon, Y. Zheng, X. Xie, W. Woo, Social itinerary recommendation from user-generated digital trails. Personal and Ubiquitous Computing, 16(5), 2012
  10. D. Quercia, R. Schifanella, L. M. Aiello, The shortest path to happiness: recommending beautiful, quiet, and happy routes in the city. Yahoo Labs Barcelona, 2014
  11. M. Zavershynskyi, Higher-Order Voronoi Diagrams of Polygonal Objects, PhD Thesis, USI INF, 2014
Guillaume Salha

Guillaume is a Data Scientist, he was intern at ComplexCity during summer 2015

comments powered by Disqus