Urban Computing and Big Data

<div class="iteye-blog-content-contain" style="font-size: 14px;">
City Computing Basic framework and core issues of 
Basic framework 
The basic framework of urban computing includes urban perception and data capture, data management, urban data analysis, and service delivery ( Figure 1). Compared with "single-data, single-task" systems such as natural language analysis and image processing, urban computing is a "multi-data, multi-task" system. Tasks in urban computing include improving urban planning, alleviating traffic congestion, protecting the natural environment, and reducing energy consumption. And a task needs to use multiple kinds of data at the same time. For example, in the design process of urban planning, it is necessary to refer to various data sources such as road structure, distribution of points of interest, and traffic flow at the same time. 

City Awareness how to use the city's existing resources (such as mobile phones, sensors, vehicles, people, etc.) Under the premise of automatically perceiving the rhythm of the city, it is an important research topic. How to efficiently and reliably collect and transmit data from a large number of sensors and devices will bring challenges to the existing sensor network technology. In addition, it is a new concept that people as sensors participate in the city perception process. For example, when a disaster strikes, some users post messages or upload photos on social networks. These users are actually perceiving what is happening around them. The user's card swiping behavior when entering and leaving the subway station also indirectly helps us perceive the congestion of the subway system and people's travel. Humans have endowed traditional sensors with powerful perception capabilities and unprecedented flexibility, but the data generated is more random and disordered (such as text on Weibo), and the time of data generation has become unpredictable and uncontrollable, which makes the data more random. Collection and parsing pose challenges. 
 
Management of massive heterogeneous data The data generated by cities is varied, Properties vary widely. For example, weather is time series data, points of interest are spatial point data, roads are spatial map data, human movement is trajectory data (time + space), traffic flow is stream data, and information posted by users on social networks is text or image data. How to manage and integrate large-scale heterogeneous data is a new challenge. Especially when multiple kinds of data are used in an application, only by establishing the association between different data in advance can the subsequent analysis and mining process become efficient and feasible. 
 
Collaborative Computing of Heterogeneous Data This part includes three aspects: (1) How to obtain mutually enhanced data from different data sources Knowledge is a new subject. Traditional machine learning is often based on a single data, such as natural language processing mainly analyzes text data, and image vision is mainly based on image data. In many applications of urban computing, data of different natures are treated equally, and the effect is not ideal. (2) While ensuring the depth of knowledge extraction, how to improve the analysis efficiency of big data so as to meet many applications with high real-time requirements in urban computing (such as air quality prediction, abnormal event monitoring, etc.) is also a difficult problem. (3) The increase of data dimension also easily leads to the problem of data sparsity. When the data size reaches a certain level, simple matrix factorization algorithms become difficult to perform. 
 
Hybrid Systems Urban computing often leads to hybrid systems , such as the cloud plus terminal model, that is, information is generated in the physical world, collected through terminal devices to the cloud (virtual world) for analysis and processing, and finally the cloud provides the extracted knowledge as a service to end users in the physical world. Data travels back and forth across the physical and virtual worlds, from decentralized to centralized to decentralized. This puts forward higher requirements for the design and construction of the system. The fast driving route design based on floating car data and the monitoring of urban abnormal events are typical hybrid systems. 
 
<
Urban Planning
Urban congestion in part highlights that the existing road network is no longer designed to meet the needs of evolving urban traffic flows. As shown in Figure 2(a), the city is divided into regions by using main roads such as high-speed and ring roads, and then some characteristics of large-scale traffic trajectory data traveling between different regions can be analyzed to find regions with poor connectivity. , so as to explore the shortcomings of the existing urban road network. Figure 2(b) shows the analysis results based on the 3-month trajectory data of more than 30,000 taxis in Beijing. These results can be used as a reference for the development of the next version of the transportation plan. At the same time, by comparing the test results for two consecutive years, it can be verified whether some of the plans that have been implemented (such as new roads and subways) are reasonable. 
<img style="text-align: center; display: block;" title="Urban Computing and Big Data" src="http://s11.sinaimg.cn/mw690/4caedc7agx6C1UGkkae0a&690" alt="Urban Computing and Big Data" width="490" height="277">
The continuous development of the city has created different functional areas such as cultural and educational, commercial and residential areas. Accurately grasping the distribution of these areas is of great significance to formulate reasonable urban planning. But the function of an area is not single, for example, there are still restaurants and commercial facilities in the science, culture and education area. Therefore, an area needs to be expressed by a distribution of functions (eg 70% of the functions are commercial, 20% of the functions are residential, and the rest is education). Since an area is mixed with many different types of interest points, and the role and frequency of each interest point are difficult to predict, this poses a great challenge to urban planning. For example, both are restaurants, and the regional functions reflected by a small store in a community and a large store such as Quanjude are completely different. 
 
Combining POI data and people's movement patterns,<a href="http://research.microsoft.com/apps/pubs/default .aspx?id=163746">Discovering regions of different functions in a city using human mobility and POIs</a>analyzed different functional areas in the city. As shown in Figure 3(a), the regions of the same color have the same functional distribution (eg, the red regions are mainly science, culture and education areas). The mobility data of people used in the figure is extracted from the trajectory data of taxis, which contains information on where passengers get on and off. Human mobility data can be a good way to distinguish the popularity of POIs in the same category, and can also reveal the function of an area. For example, if there is an area where most people leave around 8 am and return at 7 pm, this area is likely to be a residential area. The main function of an area is culture and education, but it does not mean that any location in the area serves culture and education. So, given a function, we want to know where its core area is. Figure 3(b) shows the core area of a mature business district. The darker the color, the higher the probability that the area is a mature business district. 

Having a taxi is a problem many big cities face. By analyzing the pickup and drop-off records of taxi passengers, T-Finder provides a two-way recommendation service for drivers and passengers. On the one hand, the system suggests places for taxi drivers to hang out. Just drive to these locations, and drivers will pull passengers in the shortest time (on the road or at recommended locations) and maximize revenue. On the other hand, as shown in Fig. 5(a), the system recommends some surrounding road segments to passengers, on which the probability of finding an empty car is higher (different colors indicate different probabilities, blue is the highest and red is the lowest). At the same time, T-Finder can also predict the number of empty cars that will enter some nearby taxi stops in the next half hour. T-Finder can alleviate the problem of taxi difficulties during off-peak hours through recommendations, but the system cannot really solve the problem during peak hours. T-Share solves this problem through a real-time dynamic carpooling solution for taxis. In the T-Share system, users submit taxi requests through their mobile phones, indicating where to get on and off, the number of passengers and the expected time to reach the destination. The background system maintains the status of all taxis in real time. After receiving a user request, it searches for the optimal car that satisfies the new user conditions and the existing passenger conditions. The optimal here refers to the minimum mileage added by the taxi to pick up a new user. As shown in Figure 5(b), the taxi is planned to pick up u1 and u2 successively, put down u1 to pick up u3, then put u2, and then put u3 (+ means getting on, - means getting off). According to the simulation results, the TShare system can save 800 million liters of gasoline for Beijing a year (which can be used by 1 million vehicles for 10 months, worth 1 billion yuan, and reduce carbon dioxide emissions by 1.6 billion kilograms), and passengers can take taxis 3 times more likely, but costs 7% less and taxi drivers earn 10% more. 
 
<img style="text-align: center; display: block;" title="Urban Computing and Big Data" src="/ admin/blogs/" alt="Urban Computing and Big Data" width="490" height="293">
 Figure5 Taxi solutions in urban computing
 
There is also some research that uses the data of passengers swiping cards in the subway system to estimate the degree of congestion within a single subway station and the travel time between different stations, so as to optimize the choice of people's travel route, time and ticket purchase method. Some people suggest bus routes by analyzing the trajectory data of taxis. If a large number of people take taxis from one place to another, it means that the two places need bus lines to connect.
 
Environment
Air quality information is of great significance to control pollution and protect people's health. Many cities have begun to sense the air quality on the ground in real time by building ground air monitoring stations. However, due to the high construction cost of monitoring stations, the stations in a city are limited and cannot fully cover the entire city. As shown in Figure 6(a), there are only 22 air monitoring stations in the urban area of Beijing (an average of about 100 square kilometers for one station). However, air quality is affected by many factors (such as surface vegetation, traffic flow, building density, etc.), and varies unevenly with regions. If there is no monitoring station in an area, we do not know how good or bad the air quality is in the area, let alone use a general data to summarize the air condition of the whole city. 
<img style="text-align: center; display: block;" title="Urban Computing and Big Data" src="http://s2. sinaimg.cn/mw690/4caedc7agx6C1UNeqR321&690" alt="Urban Computing and Big Data" width="490" height="276">
Utilizing group awareness is one way to address this problem. For example, the "Copenhagen Wheel" project installs some sensors in bicycle wheels and sends the collected data to a backend server through the user's mobile phone. By relying on the power of the group, we can sense the temperature, humidity and carbon dioxide concentration in different corners of the whole city. Due to the limitation of sensor size and sensing time, this method is only suitable for some gases, such as carbon monoxide and carbon dioxide. Because the sensor is bulky and inconvenient to carry, it takes 2 to 4 hours to measure suspended solids such as fine particulate matter (PM2.5) to generate more accurate data. 
 
U-Air utilizes limited air quality data from ground monitoring stations, combining traffic flow, road structure , distribution of interest points, meteorological conditions and people's flow patterns and other big data, based on machine learning algorithms to establish a mapping relationship between data and air quality, so as to infer the fine-grained air quality of the entire city. Figure 6(b) shows the fine-grained air quality in Beijing at a certain moment (where different colors represent different pollution indices, green is excellent). 
 
Social & Entertainment
The prevalence of social networks, especially location-based social networks, has brought rich media data such as user relationship graphs, location information (check-ins and trajectories) ), photos and videos, etc. These data not only reflect personal preferences and habits, but also reflect the lifestyle and movement patterns of people throughout the city. Based on these data, many recommendation systems are generated, including friend recommendation, community recommendation, location recommendation, travel itinerary recommendation, and behavioral activity recommendation. Literature<a href="http://research.microsoft.com/apps/pubs/?id=191797">A survey on recommendations in location-based social networks</a> various recommender systems in social networks. 
 
Social applications in urban computing place greater emphasis on extracting collective intelligence from the social media data of large numbers of users. It is one of the important characteristics of a city that people participate in the calculation process as an important perception and calculation unit. For example, a user's check-in or photo data with landmarks can be regarded as indeterminate trajectories, since the user does not keep checking in or taking pictures. When obtaining such a trajectory data, we cannot determine the specific route selected by the user, as shown in Figure 7(a). However, when we superimpose the uncertain lines of many users, we can guess the most probable line, as shown in Figure 7(b), that is, "uncertain + uncertain → certain". Such apps can help people plan their travel itineraries. For example, if a user wants to go to Houhai, Temple of Heaven and the Summer Palace in one route, input these three points into the system, and a most popular route can be calculated based on the check-in data of the public. 
<img style="text-align: center; display: block;" title="Urban Computing and Big Data" src="http://s16. sinaimg.cn/mw690/4caedc7agx6C1UNjAfR1f&690" alt="Urban Computing and Big Data" width="490" height="225">
 Social media data also contributes to other aspects of urban computing. For example, through the information posted by netizens in social networks to predict the results of presidential elections, the spread of diseases and the trend of housing prices, to detect abnormal events and disasters, to analyze traffic flow, to design advertising push and commercial site selection. Social media can also analyze the style of a city and the similarities between different cities.
 
Energy consumption
Documentation<a href="http://research.microsoft.com/apps/pubs/?id=196236">Sensing the pulse of urban refueling behavior</a>Using the waiting time of taxis equipped with GPS at the gas station to estimate the queue length of the gas station, and to estimate the number of vehicles and the amount of gas in the gas station at this time. By summarizing the data of gas stations in the city, the amount of fuel consumed (added to the car's fuel tank) at any time can be calculated. These data can be used in three aspects: first, to provide recommended information to users who need to refuel, and to find the gas station with the shortest queuing time; second, to let gas station operators know the refueling needs of various regions, so as to consider adding new ones. station or dynamically adjust the working hours of some gas stations; third, the government can grasp the fuel consumption of the entire city in real time and formulate a more reasonable energy strategy, as shown in Figure 8. 
Economic
Urban economics is a relatively mature field of study. For example, analyze the factors that determine the price of land, the impact of land use restrictions on the economy, the impact of company location and the location of people's choice of housing on the future economy of the surrounding area, etc. 
 
Documentation<a href="http://www.slideshare.net/dmytrokaramshuk/geospotting-mining-online-locationbased-services-for- optimal-retail-store-placement">Geo-spotting: mining online location-based services for optimal retail store placement</a>Provide location recommendations for commercial site selection by analyzing the check-in data of a large number of users. For example, what is the ideal location to open a new McDonald's restaurant. Combined with the road structure, distribution of points of interest, population mobility and many other factors to rank the value of the house. That is, when the market is going up, the housing prices of which neighborhoods will rise more; when the market is down, which neighborhoods are more resilient. Rather than using traditional economic models, these two examples above employ machine learning algorithms and a data-driven approach. 
 
There will always be some unexpected events in the city, such as natural disasters (earthquakes and floods, etc.), large-scale events and commercial promotions, traffic shigu and temporary restrictions, groups xing shi jian et al. If these things can be sensed and even warned in time, it will greatly improve urban management, improve the government's ability to respond to emergencies, ensure urban safety, and reduce losses and tragedies. 
 
Try to use specific traffic lines to further explain the reason for the anomaly. As shown in Figure 9, traffic flow anomalies occur between the two regions connected by L1, but the problem itself may not be in these two regions. The reason is that the traffic control caused by the marathon near Tiananmen, the traffic flow through the purple dotted line before had to detour to the green segment line. So the green line is the reason for this exception. Traffic anomalies are captured according to the change of the driver's chosen route, and keywords are further extracted from relevant Weibo to explain the cause of the anomaly, such as wedding fairs, road collapses, etc. 
 
<img style="text-align: center; display: block;" title="City calculation and Big Data" src="/admin/blogs/" alt="Urban Computing and Big Data" width="490" height="275">
 
 Figure 9 Analyze traffic anomalies
<a href="http://shiba.iis.u-tokyo.ac.jp/song/wp-content/uploads/papers/SIGKDD13.pdf">Modeling and probabilistic reasoning of population evacuation during large-scale disaster</a>Modeling, forecasting and simulating the movement and evacuation behavior of victims after the Great Japan Earthquake and the Fukushima Nuclear Accident by analyzing the one-year GPS movement trajectory database of 1.6 million Japanese people. In this way, when a similar incident occurs in the future, you can learn from the experience of previous disasters and prepare in advance. For example, recommending reasonable retreat routes for people. 
 
Main technologies of urban computing
Sensor networks to implement existing specialized sensors (such as temperature sensors, location sensors, traffic flow coils, air quality monitors, etc.) interconnection and complete data collection. 
 
Active Participatory Perception Users work together to complete a complex task by actively sharing the data they acquire. For example, each user uses the sensors on their mobile phones to share the temperature and humidity around them, thereby constructing fine-grained weather information for the city. 
 
Passive crowd perception Various information infrastructures in cities (such as cellular mobile communication systems and bus card systems) provide a good perception platform for urban computing. These infrastructures may not be specifically set up for urban computing, but when users use these infrastructures, a large amount of data is generated, and the fusion of these data together can reflect the rhythm of the city well. For example, by analyzing the subway card swiping data of a large number of users, the law of population flow in a city can be grasped. By analyzing large-scale taxi trajectory data, the traffic flow on urban roads can be sensed. Unlike active participatory sensing technology, users in passive crowd sensing do not know how their data will be used, or even know that they are generating data.
 
Data Management Technology span>
Streaming data management Since large amounts of sensor data are entered in streams, efficient streaming database techniques are The cornerstone of urban computing data management. 
 
Track Management Traffic flow, movement of people, and location-tagged social media can all be represented as trajectory data (i.e., time-stamped, time-ordered sequences of points). Trajectory processing techniques are often used in urban computing, such as map matching algorithms, trajectory compression, trajectory search, and trajectory frequent pattern mining. 
 
Graph Data Management Human relationships in social networks, population flow between different regions, traffic flow on roads, etc. can all be expressed as graphical models. Therefore, graph data management and pattern discovery techniques are particularly important. In the application of urban computing, more graph models with space-time attributes are used, that is, each node has spatial coordinate information, and the attributes of edges and points in the graph (even the graph structure) will change with time. The design of the fastest driving route, the finding of unreasonable planning in the road network, the discovery of different functional areas of the city, and the detection of traffic flow anomalies mentioned above are all based on graphs with spatiotemporal attributes as research models.
 
Spatio-temporal indexing Effective indexing can greatly improve the efficiency of data extraction. Since space and time are the two most commonly used data dimensions in urban computing, various spatial indexes and spatiotemporal indexes are commonly used techniques. What's more important is to use spatiotemporal indexing technology to associate and organize different kinds of data (such as text, traffic flow, etc.), so as to prepare for efficient data mining and analysis later. 
 
Data Mining Technology 
There are many data mining and machine learning algorithms for urban computing. Various pattern discovery, statistical learning, and artificial intelligence methods can be applied to this field. However, there are two factors to consider when selecting these technologies:
 
Learning mutually reinforcing knowledge from heterogeneous data There are generally three ways to achieve this goal: (1) Features are extracted from different data, and then these features are simply spliced and normalized directly into a feature vector, which is input into the machine learning model. This approach is not the most efficient due to the indistinguishable properties of different data. (2) Use different data successively at different stages of the computational model. For example, literature<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=152137">Urban computing with taxicabs</a> The city is divided into many areas, and then the trajectory data is mapped to these areas to construct a graph, and finally the unreasonable road planning is found by analyzing the graph model. (3) Input different data into different parts of the same computational model. For example, the literature Discovering regions of different functions in a city using human mobility and POIs inputs human mobility data and POIs data into two different parts of a topic model to analyze different functional regions of the city. Literature<a href="http://research.microsoft. Time-varying information such as traffic flow, human mobility, and meteorological data are input into a conditional random fields (CRF) to simulate the time-series correlation of air at a location, and spatial (time-invariant) correlations such as road structure, point-of-interest distribution, etc. ) information is input into the neural network to simulate the correlation of air quality between different regions. The two models then iterate and augment each other in a semi-supervised learning framework to jointly infer the air quality of a location. If you simply feed all the data into a classifier, since those spatial data do not change with time and will be ignored, the prediction effect will not be good. Time-varying information such as traffic flow, human mobility, and meteorological data are input into a conditional random fields (CRF) to simulate the time-series correlation of air at a location, and spatial (time-invariant) correlations such as road structure, point-of-interest distribution, etc. ) information is input into the neural network to simulate the correlation of air quality between different regions. The two models then iterate and augment each other in a semi-supervised learning framework to jointly infer the air quality of a location. If you simply feed all the data into a classifier, since those spatial data do not change with time and will be ignored, the prediction effect will not be good. 
 
Coping with data sparsity strong> Big data does not contradict the sparsity of data. Taking the fine-grained air prediction of a city as an example, the traffic flow, people flow, roads, and points of interest data we can observe are all big data, and since only a limited number of monitoring stations can produce air quality readings, the training data is sparse. Another example is the use of taxis to estimate fuel consumption in a city. The GPS track data of taxis is huge, but at some point a considerable number of gas stations do not have taxis coming and going. How to estimate the fuel consumption of these sites is also a problem of dealing with data sparsity. There are usually three ways to solve this problem: (1) Use a semi-supervised learning algorithm or a transfer learning algorithm. For example, semi-supervised learning algorithms are used in the literature to make up for the sparsity problem of training samples caused by the lack of air monitoring stations. (2) Using matrix decomposition algorithm and collaborative filtering. Urban fuel consumption estimation uses this method to solve the problem of data sparsity. (3) Clustering algorithm based on similarity. Suppose we need to estimate the number of vehicles driving on the road based on the coil sensors buried in the ground, but since not all roads have coils buried, the flow on many roads cannot be estimated. According to the topological results of the roads and the distribution of interest points around them, we can calculate the similarity between different roads to cluster the roads. The roads thus classified into the same class are likely to have the same traffic pattern. Thus, in a class, we can assign the readings of roads with sensors to those roads without sensors. 
 
Various optimization techniques are also frequently used in urban computing. For example literature<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=174865">T-Share: a large scale dynamic taxi ridesharing service</a> is to find the best taxi that can pick up and drop off passengers by combining space-time search technology and route optimization. Literature<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=174345">Inferring the root cause in road traffic anomalies</a> Use linear programming to analyze the traffic flows that are most likely to cause traffic anomalies. Literature<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=151647">Where to Find My Next Passenger?</a> To the taxi driver Recommend the best route for passengers to find. 
 
Visualization techniques for mixed data
Visualization helps us understand acquired knowledge and patterns in an intuitive way . Figure 10 is a heat map of the number of people who arrive at each area by taxi between 12 and 14 o'clock every weekday (the darker the color, the more people). By playing such heat maps in different time periods continuously, the population flow patterns of the entire city can be dynamically reflected. Relatively speaking, the central business district in eastern Beijing has a higher popularity. Unlike single data visualization, visualization techniques in urban computing need to consider multiple dimensions simultaneously, of which space and time are two crucial dimensions. 
<a href="http://photo.blog.sina.com.cn/showpic.html#blogid=4caedc7a0102euyo&url=http://album .sina.com.cn/pic/4caedc7agx6C1UUEVRK68" target="_blank"><img style="text-align: center; display: block;" title="Urban Computing and Big Data"

Conclusion
Urban computing is an emerging intersection that is The intersection of computer science and traditional urban planning, transportation, energy, economics, environment and sociology in urban space. It is related to the future quality of life and sustainable development of human beings. The arrival of the era of big data provides more opportunities and broader prospects for urban computing. 
 
Affirmation: "Urban Computing and Big Data" Reprinted on the Microsoft Research blog with the consent of "China Computer Society Newsletter", and the copyright belongs to "China Computing Society Newsletter". The original text was published in "China Computer Federation Newsletter"2013Years8<

Urban Computing and Big Data

Guess you like