# McRAPTOR based delay-predicting robust journey planner A. Coudray, F. Grimberg, C. Pulver, T. Turner This repository represents the final project of group **lgptguys** for the spring 2020 _Lab in Data Science_ course at EPFL. Find the original assignment for the project in [Project_Description.md](Project_Description.md) Our work consists of a robust journey planner, which finds journeys between stops within a 15km radius around Zurich Hbf and computes their probability of success based on the distribution of delays over the past years. Given a source stop, a target stop, a latest acceptable arrival time, and a lowest acceptable probability of success for the entire journey, our journey planner finds the journey whose departure time is latest among all acceptable options. Our journey planner is based on the multiple-criteria (Mc) extension of the RAPTOR algorithm (Round bAsed Public Transit Optimized Router) detailed in [Delling, D., Pajor, T., & Werneck, R. F. (2015). Round-based public transit routing. *Transportation Science, 49*(3), 591-604.](https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/raptor_alenex.pdf). ### Content 1. [Using the files in this repo](#1.-Using-the-files-in-this-repo) 2. [Choosing a route planning algorithm](#2.-Choosing-a-route-planning-algorithm) 3. [Modeling the public transit infrastructure](#3.-Modeling-the-public-transit-infrastructure) 4. [Building a predictive model for the success of a journey](#4.-Building-a-predictive-model-for-the-success-of-a-journey) 5. [Implementing a route planning algorithm](#5.-Implementing-a-route-planning-algorithm) ## 1. Using the files in this repo ### Finding a journey from A to B To use our robust journey planner, simply open [our McRAPTOR notebook](notebooks/MC_RAPTOR.ipynb) and follow the instructions outlined at the start of the notebook. ### Reproducing our work To reproduce all of our work, starting from the available SBB data, please follow these steps: - Transform the GTFS files for the Swiss transport network to GTFS-like files `stop_times` and `transfers` by running the [data wrangling before RAPTOR notebook](notebooks/data_wrangling_before_RAPTOR.ipynb) in its entirety. - Pull the `.csv` HDFS files from the distributed file system to the local Renku environment by executing the [transfer to local notebook](notebooks/transfer_to_local.ipynb) - Transform the `.csv` GTFS-like tables to python objects usable within our implementation of RAPTOR using the [generating arrays for RAPTOR notebook](notebooks/generating_arrays_for_RAPTOR.ipynb) - Generate `match_datasets_translation.csv` HDFS file from [hdfs_match_datasets.ipynb](notebooks/hdfs_match_datasets.ipynb) - Generate dictionnaries of delay distribution in local `data/d_all.pck.gz` and `data/d_real.pck.gz` from [hdfs_get_distributions.ipynb](notebooks/hdfs_get_distributions.ipynb) - Generate pre-computed probability of success from distribution - Then, open [the McRAPTOR notebook](notebooks/MC_RAPTOR.ipynb) and follow steps 2-4 at the start of the notebook. ## 2. Choosing a route planning algorithm We investigated several options, such as implementing a custom Breadth-First Search (BFS) or Dijkstra algorithm on a time-expanded graph. However, after a brief literature search, we decided to adapt and implement the multiple criteria variant (McRAPTOR) of the state-of-the-art algorithm described in [Delling et al.](https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/raptor_alenex.pdf) (cf. complete citation above). The main arguments in favor of this algorithm are as follows: - RAPTOR (and its variants) does not require transforming the public transit schedule into a (time-extended or time-expanded) graph. Instead, the algorithm directly takes advantage of the nature of public transit networks, where trips occur only along certain routes and at certain times of day. This is much more efficient, as it avoids the crass increase in graph size linked to time-expansion. Our implementation of McRAPTOR runs on a single core and a very modest amount of RAM (less than 1G). The RAPTOR approach also more closely mimicks actual passenger behaviour, which tends to fully explore the currently available routes before exploring possible transfers between routes. - The paper proposes a generic multiple-criteria variant based on notions of [Pareto efficiency](https://en.wikipedia.org/wiki/Pareto_efficiency), primarily Pareto domination and Pareto sets. This variant can be adapted to include any number of arbitrary optimization criteria at the cost of a slightly less efficient implementation. However, most of the increase in execution times from RAPTOR to McRAPTOR can probably be attributed to the inescapable fact that multiple-criteria optimization is a harder problem than single-criteria optimization, simply because it admits more solutions at each step along the way. ## 3. Modeling the public transit infrastructure RAPTOR models the public transit infrastructure similarly to the GTFS standard. The "atoms" of this representation are the stop times (stored in the GTFS-like array StopTimes): an arrival and a departure time record for each and every time a vehicle stops to pick up passengers somewhere in the transport network. The key idea behind RAPTOR is to organise these arrival and departure times in a consistant way such that the network is explored efficiently, that is by route and trip. A route is very similar to our intuitive understanding of a transport line: it consists of a sequence of stops always visited in the same order. A trip (always assigned to a route) adds the exact stop times for one journey traversing the route. The key idea behind the data structure used by RAPTOR is to create a collection of arrays storing pointers corresponding to routes and stops to explore and navigate the StopTimes array efficiently. In particular, the speed of RAPTOR makes use of in-memory data locality for the arrays. For this reason, we work exclusively with `Numpy` arrays. ### The SBB and timetable datasets ### The data structures of the RAPTOR model Actually, should we include a summary of the data structures or just reference the appendix of the paper? ### Main contributions (rename this) $\rightarrow$ How did we generate the data structures required by the RAPTOR algorithm from the available data? ### Validation / testing / evaluation $\rightarrow$ How did we validate / test this part of the project? - Sanity checks in your notebooks - As a further means of testing, we used the produced datasets to find journeys between random stops with our robust journey planner. ## 4. Building a predictive model for the success of a journey ### Matching SBB and Timetable datasets _Note : This section summarize what is done in `hdfs_match_datasets.ipynb`_ As we use _timetable_ dataset to build routes and trips, but sbb to compute delays, we need a robust translation between both datasets. What we needed was to translate trip_id and stop_id to be able to compute delay distributions for every pair of `trip_id` x `stop_id`. #### Get corresponding stop_id between two datasets We first look at the station names in _timetable_ dataset. Stop_id can be given in multiple formats shown below : - `8502186` : the format defining the stop itself, which matches `bpuic` field in _sbb_ dataset We will call the 3 next ones __Special cases__ throughout the notebook : - `8502186:0:1` or `8502186:0:2` : The individual platforms are separated by “:”. A “platform” can also be a platform+sectors (e.g. “8500010:0:7CD”). - `8502186P` : All the stops have a common “parent” “8500010P”. - `8502186:0:Bfpl` : if the RBS uses it for rail replacement buses. source : [timetable cookbook](https://opentransportdata.swiss/en/cookbook/gtfs/), section stops.txt In the sbb actual_data we find equivalent to stop_id in its first format defining the station without platform information, in its `bpuic` field. To get from _timetable_ to a _sbb_-compatible format, we used only __first 7 characters__ of _timetable_ stop_id. #### Get corresponding trip_id between two datasets In sbb dataset, the trip ids are defined by `FAHRT_BEZEICHNER` field and in timetable by `trip_id`. To match both datasets, we matched `stop_id` , `departure_time` and `arrival_time` (with a join) to get corresponding trip_id on the same line. The idea is to take every trip_id with more than X matches between the two datasets. We decided to use 2 as a minimum number of matches needed. _Note : with a threshold > 2 we were not able to get InterCity / InterRegio trains, which have very few stops in the 15km perimeter._ These labels will be used to differentiate 3 different ways to delay distributions : - __One-to-one__ we find a clear match : we use distribution of delays on weekdays for a given trip/station_id based on all past sbb data. - __One-to-many__ we find multiple match : - Matches are aggregated together in the final distribution table - __One-to-none__ we find no match for trip_id between datasets : as described later, we will use delay distribution of similar trip (sharing stop_id, transport type and hour) to infer the delay. ### Get Distributions of Delay Times per trip and station _Note : This summarize `hdfs_get_distributions.ipynb`_ The goal of this chapter is to create a distribution of arrival delays for each station / trip_id pair, to be used later on to compute transfer probabilities. These are then used in McRaptor implementation, to choose the best trip according to their time but also their __probability of success__. #### Work from translation tables We used data generated in `hdfs_match_datasets.ipynb`, that matches trip_id between _timetable_ and _sbb_ dataset. We begin by looking at all trip_id that are found in both dataset with at least 5 stations in common. Our goal is to find a match in sbb dataset for all _timetable_ trips (and not the other way around). So we will focus on getting this assymetrical correspondance table. In order to do that, we need to do multiple join, as we want to join 3 tables : _sbb_ data which contains information about delays, `joined_trip_atL5_3` table which contains translation between trip_id in two datasets, and `stop_time` which contains all the unique stop_id x trip_id used for later steps. - First, we join _sbb_ data `sbb_filt_forDelays_GeschaetzAndReal_2` with translation table `joined_trip_atL5_3` to get sbb data with information about _timetable_ trip_id. - We can then use this _timetable_ trip_id to join this first table with `stop_time` table, using a _left_outer_ join, so that we get an idea of how many matches are found overall. First we load SBB data. Following cells were ran twice : once for `geschaetz` / `real` delays only, and once for `all` delays. - `geschaetz` / `real` : load and use `/user/{}/sbb_filt_forDelays_GeschaetzAndReal_2.orc` table - `all` : load and use `/user/{}/sbb_filt_forDelays_AllDelays.orc` table ### Compute probability of transfer success from delays distributions _Note : This summarize `proba_functions.ipynb` and is run in local._ To be able to compute the probability of success of a given transfert, we use the arrival delay distribution compared with the next trip departure. To be able to do that, we need delay distributions for each trip arrival to a given station. We then use a __cumulative distribution function__ to compute $P(X \leq x)$ : $${\displaystyle F_{X}(x)=\operatorname {P} (T\leq t)=\sum _{t_{i}\leq t}\operatorname {P} (T=t_{i})=\sum _{t_{i}\leq t}p(t_{i}).}$$ The strategy was to rely entirely on past data to compute $p(t_i)$, without the need of building a model which imply making additionnal assumptions. If we have enough data for a given transfer with known trip_id x stop_id, we use the the abovementionned formula to compute each $p(t_i)$ by simply using : $p(t_i) = \frac{x_i}{\sum x_i}$ with $x_i$ being the number of delays at time $t_i$ from SBB dataset. We make a few __assumptions__ : - We assume that if we have less than 2 minutes for the transfer, we miss it. - We assume the next train is on time. #### Recover missing data Whenever we cannot find a clear match for a given `trip_id` x `stop_id`, we use aggregated delay distributions from similar transfer, on which we used the same CDF function abovementionned. To recover missing or faulty data, the strategy is the following : 1. If we have more than 100 data points in `real` group, we rely exclusively on its delay distribution to compute probabilities for a given transfer on a `trip_id x stop_id`. _Note : `real` group corresponds to arrival time with status `geschaetz` or `real`, meaning it comes from actual measurments._ 2. If we do not find enough data within `real` group, we use delay distributions in `all` group (contains all delays including `prognose` status), if there is more than 100 data points for a given `trip_id x stop_id`. 3. If `all` group still does not have more than 100 data points, we rely on `recovery tables` to estimate delay distributions. The strategy is the following : - As we will always know the `stop_id`, the `time` and the `transport_type`, we rely on arrival delays from aggregated values of similar transfer. - First, we compute a table of distribution with all possible combination of `stop_id`, `time` (round to hours) and `transport_type`, and aggregate all the counts we have to compute cumulative distribution probabilities. - Is there is less than 100 data points in one of these intersections, we use the last possibilities : a table with `transport_type` x `time` aggregate counts. - The last values with no match are given the overall average of cumulative distribution probabilities for each `transport_type` with no limit for the minimum number of data points. Following this approach, we can find cumulative distribution probabilities for every combination of `trip_id x stop_id` as defined in `stop_times_df`. We will make a table with the same row order so that McRaptor can easily find their indexes. #### Evaluate / Validate recovery tables The question is : How precise are these recovery table-derived probabilities ? Which one should be used in priority ? To add ... ## 5. Implementing a route planning algorithm $\rightarrow$ comment on a adapté MC RAPTOR pour notre cas - Mc, bags, probas - route can visit same stop multiple times - foot-paths once at start, as well $\rightarrow$ how did we validate / test this part of the project