diff --git a/README.md b/README.md index 7aca17a..ea9f2d0 100644 --- a/README.md +++ b/README.md @@ -1 +1,95 @@ -Find the original assignment for the project in [Project_Description.md](Project_Description.md) \ No newline at end of file +Find the original assignment for the project in [Project_Description.md](Project_Description.md) + +### Matching SBB and Timetable datasets + +_Note : This section summarize what is done in `hdfs_match_datasets.ipynb`_ + +As we use _timetable_ dataset to build routes and trips, but sbb to compute delays, we need a robust translation between both datasets. What we needed was to translate trip_id and stop_id to be able to compute delay distributions for every pair of `trip_id` x `stop_id`. + +#### Get corresponding stop_id between two datasets + +We first look at the station names in _timetable_ dataset. Stop_id can be given in multiple formats shown below : +- `8502186` : the format defining the stop itself, which matches `bpuic` field in _sbb_ dataset + +We will call the 3 next ones __Special cases__ throughout the notebook : +- `8502186:0:1` or `8502186:0:2` : The individual platforms are separated by “:”. A “platform” can also be a platform+sectors (e.g. “8500010:0:7CD”). +- `8502186P` : All the stops have a common “parent” “8500010P”. +- `8502186:0:Bfpl` : if the RBS uses it for rail replacement buses. + +source : [timetable cookbook](https://opentransportdata.swiss/en/cookbook/gtfs/), section stops.txt + +In the sbb actual_data we find equivalent to stop_id in its first format defining the station without platform information, in its `bpuic` field. To get from _timetable_ to a _sbb_-compatible format, we used only __first 7 characters__ of _timetable_ stop_id. + +#### Get corresponding trip_id between two datasets + +In sbb dataset, the trip ids are defined by `FAHRT_BEZEICHNER` field and in timetable by `trip_id`. To match both datasets, we matched `stop_id` , `departure_time` and `arrival_time` (with a join) to get corresponding trip_id on the same line. The idea is to take every trip_id with more than X matches between the two datasets. We decided to use 2 as a minimum number of matches needed. _Note : with a threshold > 2 we were not able to get InterCity / InterRegio trains, which have very few stops in the 15km perimeter._ + +These labels will be used to differentiate 3 different ways to delay distributions : +- __One-to-one__ we find a clear match : we use distribution of delays on weekdays for a given trip/station_id based on all past sbb data. +- __One-to-many__ we find multiple match : + - Matches are aggregated together in the final distribution table +- __One-to-none__ we find no match for trip_id between datasets : as described later, we will use delay distribution of similar trip (sharing stop_id, transport type and hour) to infer the delay. + +### Get Distributions of Delay Times per trip and station + +_Note : This summarize `hdfs_get_distributions.ipynb`_ + +The goal of this chapter is to create a distribution of arrival delays for each station / trip_id pair, to be used later on to compute transfer probabilities. These are then used in McRaptor implementation, to choose the best trip according to their time but also their __probability of success__. + +#### Work from translation tables + +We used data generated in `hdfs_match_datasets.ipynb`, that matches trip_id between _timetable_ and _sbb_ dataset. We begin by looking at all trip_id that are found in both dataset with at least 5 stations in common. + +Our goal is to find a match in sbb dataset for all _timetable_ trips (and not the other way around). So we will focus on getting this assymetrical correspondance table. + +In order to do that, we need to do multiple join, as we want to join 3 tables : _sbb_ data which contains information about delays, `joined_trip_atL5_3` table which contains translation between trip_id in two datasets, and `stop_time` which contains all the unique stop_id x trip_id used for later steps. +- First, we join _sbb_ data `sbb_filt_forDelays_GeschaetzAndReal_2` with translation table `joined_trip_atL5_3` to get sbb data with information about _timetable_ trip_id. +- We can then use this _timetable_ trip_id to join this first table with `stop_time` table, using a _left_outer_ join, so that we get an idea of how many matches are found overall. + +First we load SBB data. Following cells were ran twice : once for `geschaetz` / `real` delays only, and once for `all` delays. +- `geschaetz` / `real` : load and use `/user/{}/sbb_filt_forDelays_GeschaetzAndReal_2.orc` table +- `all` : load and use `/user/{}/sbb_filt_forDelays_AllDelays.orc` table + +### Compute probability of transfer success from delays distributions + +_Note : This summarize `proba_functions.ipynb` and is run in local._ + +To be able to compute the probability of success of a given transfert, we use the arrival delay distribution compared with the next trip departure. To be able to do that, we need delay distributions for each trip arrival to a given station. We then use a __cumulative distribution function__ to compute $P(X \leq x)$ : + +$${\displaystyle F_{X}(x)=\operatorname {P} (T\leq t)=\sum _{t_{i}\leq t}\operatorname {P} (T=t_{i})=\sum _{t_{i}\leq t}p(t_{i}).}$$ + +The strategy was to rely entirely on past data to compute $p(t_i)$, without the need of building a model which imply making additionnal assumptions. If we have enough data for a given transfer with known trip_id x stop_id, we use the the abovementionned formula to compute each $p(t_i)$ by simply using : + +$$p(t_i) = \frac{x_i}{\sum x_i}$$ + +with $x_i$ being the number of delays at time $t_i$ from SBB dataset. + +We make a few __assumptions__ : +- We assume that if we have less than 2 minutes for the transfer, we miss it. +- We assume the next train is on time. + +#### Recover missing data + +Whenever we cannot find a clear match for a given `trip_id` x `stop_id`, we use aggregated delay distributions from similar transfer, on which we used the same CDF function abovementionned. + +To recover missing or faulty data, the strategy is the following : + +1. If we have more than 100 data points in `real` group, we rely exclusively on its delay distribution to compute probabilities for a given transfer on a `trip_id x stop_id`. + +_Note : `real` group corresponds to arrival time with status `geschaetz` or `real`, meaning it comes from actual measurments._ + +2. If we do not find enough data within `real` group, we use delay distributions in `all` group (contains all delays including `prognose` status), if there is more than 100 data points for a given `trip_id x stop_id`. + +3. If `all` group still does not have more than 100 data points, we rely on `recovery tables` to estimate delay distributions. The strategy is the following : + - As we will always know the `stop_id`, the `time` and the `transport_type`, we rely on arrival delays from aggregated values of similar transfer. + - First, we compute a table of distribution with all possible combination of `stop_id`, `time` (round to hours) and `transport_type`, and aggregate all the counts we have to compute cumulative distribution probabilities. + - Is there is less than 100 data points in one of these intersections, we use the last possibilities : a table with `transport_type` x `time` aggregate counts. + - The last values with no match are given the overall average of cumulative distribution probabilities for each `transport_type` with no limit for the minimum number of data points. + +Following this approach, we can find cumulative distribution probabilities for every combination of `trip_id x stop_id` as defined in `stop_times_df`. We will make a table with the same row order so that McRaptor can easily find their indexes. + +#### Evaluate / Validate recovery tables + +The question is : How precise are these recovery table-derived probabilities ? Which one should be used in priority ? + +To add ...