R10546/189fa351646bmaster
/
README.md
McRAPTOR based delay-predicting robust journey planner
A. Coudray, F. Grimberg, C. Pulver, T. Turner
This repository represents the final project of group lgptguys for the spring 2020 _Lab in Data Science_ course at EPFL. Find the original assignment for the project in [Project_Description.md](Project_Description.md)
Our work consists of a robust journey planner, which finds journeys between stops within a 15km radius around Zurich Hbf and computes their probability of success based on the distribution of delays over the past years. Given a source stop, a target stop, a latest acceptable arrival time, and a lowest acceptable probability of success for the entire journey, our journey planner finds the journey whose departure time is latest among all acceptable options. Our journey planner is based on the multiple-criteria (Mc) extension of the RAPTOR algorithm (Round bAsed Public Transit Optimized Router) detailed in Delling, D., Pajor, T., & Werneck, R. F. (2015). Round-based public transit routing. *Transportation Science, 49*(3), 591-604..
In the first section below, we describe how our implementation can be used, as well as which steps would be necessary to fully reproduce it. In the second section, we outline the main arguments based on which we chose to base our journey planner on McRAPTOR. In each section after that, we describe the design, implementation, validation and testing for a part of the project.
Using the files in this repo
Finding a journey from A to B
$\rightarrow$ instructions pour trouver une route
- Download the files stored on git-LFS: git lfs pull
- Open MC_RAPTOR.ipynb
- ...
Reproducing our work
$\rightarrow$ Lancer quel fichier dans quel ordre, avec renvoi vers les sections correspondantes du README.
Choosing a route planning algorithm
$\rightarrow$ petit rappel vite fait des avantages de RAPTOR qui nous ont incité à le choisir, avec renvoi vers le paper:
- Takes advantage of the nature of public transit networks, where trips occur only along certain routes and at certain times of day $\Rightarrow$ efficient
- The paper proposes a generic multiple-criteria version, which can be adapted to include literally any optimization criterion.
- ...
Modeling the public transport infrastructure
$\rightarrow$ How did we generate the data structures required by the RAPTOR algorithm from the available data?
Actually, should we include a summary of the data structures or just reference the appendix of the paper?
$\rightarrow$ How did we validate / test this part of the project?
Building a predictive model for the success of a journey
Matching SBB and Timetable datasets
_Note : This section summarize what is done in hdfs_match_datasets.ipynb_
As we use _timetable_ dataset to build routes and trips, but sbb to compute delays, we need a robust translation between both datasets. What we needed was to translate trip_id and stop_id to be able to compute delay distributions for every pair of trip_id x stop_id.
Get corresponding stop_id between two datasets
We first look at the station names in _timetable_ dataset. Stop_id can be given in multiple formats shown below :
- 8502186 : the format defining the stop itself, which matches bpuic field in _sbb_ dataset
We will call the 3 next ones Special cases throughout the notebook :
- 8502186:0:1 or 8502186:0:2 : The individual platforms are separated by “:”. A “platform” can also be a platform+sectors (e.g. “8500010:0:7CD”).
- 8502186P : All the stops have a common “parent” “8500010P”.
- 8502186:0:Bfpl : if the RBS uses it for rail replacement buses.
source : timetable cookbook, section stops.txt
In the sbb actual_data we find equivalent to stop_id in its first format defining the station without platform information, in its bpuic field. To get from _timetable_ to a _sbb_-compatible format, we used only first 7 characters of _timetable_ stop_id.
Get corresponding trip_id between two datasets
In sbb dataset, the trip ids are defined by FAHRT_BEZEICHNER field and in timetable by trip_id. To match both datasets, we matched stop_id , departure_time and arrival_time (with a join) to get corresponding trip_id on the same line. The idea is to take every trip_id with more than X matches between the two datasets. We decided to use 2 as a minimum number of matches needed. _Note : with a threshold > 2 we were not able to get InterCity / InterRegio trains, which have very few stops in the 15km perimeter._
These labels will be used to differentiate 3 different ways to delay distributions :
- One-to-one we find a clear match : we use distribution of delays on weekdays for a given trip/station_id based on all past sbb data.
- One-to-many we find multiple match :
- Matches are aggregated together in the final distribution table
- One-to-none we find no match for trip_id between datasets : as described later, we will use delay distribution of similar trip (sharing stop_id, transport type and hour) to infer the delay.
Get Distributions of Delay Times per trip and station
_Note : This summarize hdfs_get_distributions.ipynb_
The goal of this chapter is to create a distribution of arrival delays for each station / trip_id pair, to be used later on to compute transfer probabilities. These are then used in McRaptor implementation, to choose the best trip according to their time but also their probability of success.
Work from translation tables
We used data generated in hdfs_match_datasets.ipynb, that matches trip_id between _timetable_ and _sbb_ dataset. We begin by looking at all trip_id that are found in both dataset with at least 5 stations in common.
Our goal is to find a match in sbb dataset for all _timetable_ trips (and not the other way around). So we will focus on getting this assymetrical correspondance table.
In order to do that, we need to do multiple join, as we want to join 3 tables : _sbb_ data which contains information about delays, joined_trip_atL5_3 table which contains translation between trip_id in two datasets, and stop_time which contains all the unique stop_id x trip_id used for later steps.
- First, we join _sbb_ data sbb_filt_forDelays_GeschaetzAndReal_2 with translation table joined_trip_atL5_3 to get sbb data with information about _timetable_ trip_id.
- We can then use this _timetable_ trip_id to join this first table with stop_time table, using a _left_outer_ join, so that we get an idea of how many matches are found overall.
First we load SBB data. Following cells were ran twice : once for geschaetz / real delays only, and once for all delays.
- geschaetz / real : load and use /user/{}/sbb_filt_forDelays_GeschaetzAndReal_2.orc table
- all : load and use /user/{}/sbb_filt_forDelays_AllDelays.orc table
Compute probability of transfer success from delays distributions
_Note : This summarize proba_functions.ipynb and is run in local._
To be able to compute the probability of success of a given transfert, we use the arrival delay distribution compared with the next trip departure. To be able to do that, we need delay distributions for each trip arrival to a given station. We then use a cumulative distribution function to compute $P(X \leq x)$ :
$${\displaystyle F_{X}(x)=\operatorname {P} (T\leq t)=\sum _{t_{i}\leq t}\operatorname {P} (T=t_{i})=\sum _{t_{i}\leq t}p(t_{i}).}$$
The strategy was to rely entirely on past data to compute $p(t_i)$, without the need of building a model which imply making additionnal assumptions. If we have enough data for a given transfer with known trip_id x stop_id, we use the the abovementionned formula to compute each $p(t_i)$ by simply using :
$p(t_i) = \frac{x_i}{\sum x_i}$
with $x_i$ being the number of delays at time $t_i$ from SBB dataset.
We make a few assumptions :
- We assume that if we have less than 2 minutes for the transfer, we miss it.
- We assume the next train is on time.
Recover missing data
Whenever we cannot find a clear match for a given trip_id x stop_id, we use aggregated delay distributions from similar transfer, on which we used the same CDF function abovementionned.
To recover missing or faulty data, the strategy is the following :
- If we have more than 100 data points in real group, we rely exclusively on its delay distribution to compute probabilities for a given transfer on a trip_id x stop_id.
_Note : real group corresponds to arrival time with status geschaetz or real, meaning it comes from actual measurments._
- If we do not find enough data within real group, we use delay distributions in all group (contains all delays including prognose status), if there is more than 100 data points for a given trip_id x stop_id.
- If all group still does not have more than 100 data points, we rely on recovery tables to estimate delay distributions. The strategy is the following :
- As we will always know the stop_id, the time and the transport_type, we rely on arrival delays from aggregated values of similar transfer.
- First, we compute a table of distribution with all possible combination of stop_id, time (round to hours) and transport_type, and aggregate all the counts we have to compute cumulative distribution probabilities.
- Is there is less than 100 data points in one of these intersections, we use the last possibilities : a table with transport_type x time aggregate counts.
- The last values with no match are given the overall average of cumulative distribution probabilities for each transport_type with no limit for the minimum number of data points.
- As we will always know the stop_id, the time and the transport_type, we rely on arrival delays from aggregated values of similar transfer.
Following this approach, we can find cumulative distribution probabilities for every combination of trip_id x stop_id as defined in stop_times_df. We will make a table with the same row order so that McRaptor can easily find their indexes.
Evaluate / Validate recovery tables
The question is : How precise are these recovery table-derived probabilities ? Which one should be used in priority ?
To add ...
Implementing a route planning algorithm
$\rightarrow$ comment on a adapté MC RAPTOR pour notre cas
$\rightarrow$ how did we validate / test this part of the project