1.TransformtheGTFSfilesfortheSwisstransportnetworktoGTFS-likefiles`stop_times` and `transfers`byrunningthe[datawranglingbeforeRAPTORnotebook](notebooks/data_wrangling_before_RAPTOR.ipynb)initsentirety.
5.Generatedictionnariesofdelaydistributionfromhdfsandsaveittolocalinpickles:`data/d_all.pck.gz` and `data/d_real.pck.gz`from[hdfs_get_distributionnotebook](notebooks/hdfs_get_distributions.ipynb)
WeformattedtheGTFSfile`stop_times.txt` to a cleaned GTFS-like `stop_times` table which directly corresponds to the StopTimes array at the core of RAPTOR. To ensure a coherent file structure within RAPTOR, trips and routes were reconstructed directly from the cleaned `stop_times`table.Indepth-explanationsofthedatawranglingprocesstomodelthetransportnetworkinRAPTORcanbefoundinthe[datawranglingbeforeRAPTORnotebook](notebooks/data_wrangling_before_RAPTOR.ipynb)and[generatingarraysforRAPTORnotebook](notebooks/generating_arrays_for_RAPTOR.ipynb).
OneoftheinputofourMcRaptorimplementationisa2-dimensionnalarrayofpre-computedprobabilities,leveragedfrompastdelaysinsbbdataset.Wepre-computedprobabilityofsuccessforeverycombinationsof`trip_id` and `stop_id`(~250kcombinationsforstopsin15kmperimeteraroundZurichHB).ThesearethenusedinMcRaptorimplementation,tochoosethebesttripnotonlybasedonshortesttimebutalsoontheir__probabilityofsuccess__.
Ouraimistofindadistributionforeach`trip_id` x `stop_id`, even if the `trip_id`cannotbetranslatedbetweendatasets.
__stop_id:___timetable_`stop_id` may contains additional information about platform, and therefore need to be trimmed to its first 7 characters to match _sbb_ `bpuic`id.
__trip_id:__Tomatchbothdatasets,wematched`stop_id` , `departure_time` and `arrival_time`(withajoin)togetcorrespondingtrip_idbetweendatasets.Theideaistotakeeverytrip_idwithmorethanXmatchesbetweenbothdatasets.
Usingdatageneratedin`hdfs_match_datasets.ipynb`, we can compute arrival delays from _sbb_ dataset and match it with _timetable_ `trip_id`. For each given `trip_id` x `stop_id`,wegenerateanarrayofdelaysfrom-1(containsalltripsaheadofschedule)to+30(alsocontainstripsbeingmorethan30minuteslate).
Wegeneratetwotablesofarrivaldelaydistributionper`trip_id`x`stop_id` : once for `geschaetz` / `real` delays only, and once for `all`delays.
-`geschaetz` / `real`:comesfromactualmeasurements
-`all` : includes all kind of arrival time, included `prognose`status,whichmeantheywereestimated.Inanycaseweassumethiswouldbebetterthanestimatingitourself.
Delaysin`geschaetz/real` group are used in priority if there is enough data, otherwise delays including `prognose`statusmaybeusedinplace.
Wheneverwecannotfindaclearmatchforagiven`trip_id` x `stop_id`,weuseaggregateddelaydistributionsfromsimilartransfer,onwhichweusedthesameCDFfunctionabovementionned.
1.Ifwehavemorethan100datapointsin`real` group, we rely exclusively on its delay distribution to compute probabilities for a given transfer on a `trip_id x stop_id`.
_Note:`real` group corresponds to arrival time with status `geschaetz` or `real`,meaningitcomesfromactualmeasurments._
2.Ifwedonotfindenoughdatawithin`real` group, we use delay distributions in `all` group (contains all delays including `prognose` status), if there is more than 100 data points for a given `trip_id x stop_id`.
3.If`all` group still does not have more than 100 data points, we rely on `recovery tables`toestimatedelaydistributions.Thestrategyisthefollowing:
-Aswewillalwaysknowthe`stop_id`, the `time` and the `transport_type`,werelyonarrivaldelaysfromaggregatedvaluesofsimilartransfer.
-First,wecomputeatableofdistributionwithallpossiblecombinationof`stop_id`, `time` (round to hours) and `transport_type`,andaggregateallthecountswehavetocomputecumulativedistributionprobabilities.
-Isthereislessthan100datapointsinoneoftheseintersections,weusethelastpossibilities:atablewith`transport_type` x `time`aggregatecounts.
Followingthisapproach,wecanfindcumulativedistributionprobabilitiesforeverycombinationof`trip_id x stop_id` as defined in `stop_times_df`.WewillmakeatablewiththesamerowordersothatMcRaptorcaneasilyfindtheirindexes.
[Dellingetal.](https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/raptor_alenex.pdf) (cf. complete citation above) give a pseudocode of the RAPTOR algorithm, which solves the earliest arrival problem while simultaneously optimizing for the lowest possible number of individual trips in a journey.
Insteadofasinglearrival/departuretimeperround,eachstopcannowhaveanarbitrarynumberof[Paretooptimal](https://en.wikipedia.org/wiki/Pareto_efficiency) solutions ("label") per round.