diff --git a/README.md b/README.md
index 7aca17a..ea9f2d0 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,95 @@
-Find the original assignment for the project in [Project_Description.md](Project_Description.md)
\ No newline at end of file
+Find the original assignment for the project in [Project_Description.md](Project_Description.md)
+
+### Matching SBB and Timetable datasets 
+
+_Note : This section summarize what is done in `hdfs_match_datasets.ipynb`_
+
+As we use _timetable_ dataset to build routes and trips, but sbb to compute delays, we need a robust translation between both datasets. What we needed was to translate trip_id and stop_id to be able to compute delay distributions for every pair of `trip_id` x `stop_id`. 
+
+#### Get corresponding stop_id between two datasets 
+
+We first look at the station names in _timetable_ dataset. Stop_id can be given in multiple formats shown below :
+- `8502186` : the format defining the stop itself, which matches `bpuic` field in _sbb_ dataset
+
+We will call the 3 next ones __Special cases__ throughout the notebook :
+- `8502186:0:1` or `8502186:0:2` : The individual platforms are separated by “:”. A “platform” can also be a platform+sectors (e.g. “8500010:0:7CD”).
+- `8502186P` : All the stops have a common “parent” “8500010P”.
+- `8502186:0:Bfpl` : if the RBS uses it for rail replacement buses.
+
+source : [timetable cookbook](https://opentransportdata.swiss/en/cookbook/gtfs/), section stops.txt 
+
+In the sbb actual_data we find equivalent to stop_id in its first format defining the station without platform information, in its `bpuic` field. To get from _timetable_ to a _sbb_-compatible format, we used only __first 7 characters__ of _timetable_ stop_id.
+
+#### Get corresponding trip_id between two datasets 
+
+In sbb dataset, the trip ids are defined by `FAHRT_BEZEICHNER` field and in timetable by `trip_id`. To match both datasets, we matched `stop_id` , `departure_time` and `arrival_time` (with a join) to get corresponding trip_id on the same line. The idea is to take every trip_id with more than X matches between the two datasets. We decided to use 2 as a minimum number of matches needed. _Note : with a threshold > 2 we were not able to get InterCity / InterRegio trains, which have very few stops in the 15km perimeter._
+
+These labels will be used to differentiate 3 different ways to delay distributions :
+- __One-to-one__ we find a clear match : we use distribution of delays on weekdays for a given trip/station_id based on all past sbb data. 
+- __One-to-many__ we find multiple match :
+    - Matches are aggregated together in the final distribution table
+- __One-to-none__ we find no match for trip_id between datasets : as described later, we will use delay distribution of similar trip (sharing stop_id, transport type and hour) to infer the delay.
+
+### Get Distributions of Delay Times per trip and station
+
+_Note : This summarize `hdfs_get_distributions.ipynb`_
+
+The goal of this chapter is to create a distribution of arrival delays for each station / trip_id pair, to be used later on to compute transfer probabilities. These are then used in McRaptor implementation, to choose the best trip according to their time but also their __probability of success__.
+
+#### Work from translation tables 
+
+We used data generated in `hdfs_match_datasets.ipynb`, that matches trip_id between _timetable_ and _sbb_ dataset. We begin by looking at all trip_id that are found in both dataset with at least 5 stations in common.
+
+Our goal is to find a match in sbb dataset for all _timetable_ trips (and not the other way around). So we will focus on getting this assymetrical correspondance table. 
+
+In order to do that, we need to do multiple join, as we want to join 3 tables : _sbb_ data which contains information about delays, `joined_trip_atL5_3` table which contains translation between trip_id in two datasets, and `stop_time` which contains all the unique stop_id x trip_id used for later steps.
+- First, we join _sbb_ data `sbb_filt_forDelays_GeschaetzAndReal_2` with translation table `joined_trip_atL5_3` to get sbb data with information about _timetable_ trip_id. 
+- We can then use this _timetable_ trip_id to join this first table with `stop_time` table, using a _left_outer_ join, so that we get an idea of how many matches are found overall.
+
+First we load SBB data. Following cells were ran twice : once for `geschaetz` / `real` delays only, and once for `all` delays. 
+- `geschaetz` / `real` : load and use `/user/{}/sbb_filt_forDelays_GeschaetzAndReal_2.orc` table
+- `all` : load and use `/user/{}/sbb_filt_forDelays_AllDelays.orc` table
+
+### Compute probability of transfer success from delays distributions
+
+_Note : This summarize `proba_functions.ipynb` and is run in local._
+
+To be able to compute the probability of success of a given transfert, we use the arrival delay distribution compared with the next trip departure. To be able to do that, we need delay distributions for each trip arrival to a given station. We then use a __cumulative distribution function__ to compute $P(X \leq x)$ :
+
+$${\displaystyle F_{X}(x)=\operatorname {P} (T\leq t)=\sum _{t_{i}\leq t}\operatorname {P} (T=t_{i})=\sum _{t_{i}\leq t}p(t_{i}).}$$
+
+The strategy was to rely entirely on past data to compute $p(t_i)$, without the need of building a model which imply making additionnal assumptions. If we have enough data for a given transfer with known trip_id x stop_id, we use the the abovementionned formula to compute each $p(t_i)$ by simply using :
+
+$$p(t_i) = \frac{x_i}{\sum x_i}$$
+
+with $x_i$ being the number of delays at time $t_i$ from SBB dataset.
+
+We make a few __assumptions__ :
+- We assume that if we have less than 2 minutes for the transfer, we miss it.
+- We assume the next train is on time.
+
+#### Recover missing data 
+
+Whenever we cannot find a clear match for a given `trip_id` x `stop_id`, we use aggregated delay distributions from similar transfer, on which we used the same CDF function abovementionned. 
+
+To recover missing or faulty data, the strategy is the following :
+
+1. If we have more than 100 data points in `real` group, we rely exclusively on its delay distribution to compute probabilities for a given transfer on a `trip_id x stop_id`.
+
+_Note : `real` group corresponds to arrival time with status `geschaetz` or `real`, meaning it comes from actual measurments._
+
+2. If we do not find enough data within `real` group, we use delay distributions in `all` group (contains all delays including `prognose` status), if there is more than 100 data points for a given `trip_id x stop_id`.
+
+3. If `all` group still does not have more than 100 data points, we rely on `recovery tables` to estimate delay distributions. The strategy is the following :
+    - As we will always know the `stop_id`, the `time` and the `transport_type`, we rely on arrival delays from aggregated values of similar transfer. 
+        - First, we compute a table of distribution with all possible combination of `stop_id`, `time` (round to hours) and `transport_type`, and aggregate all the counts we have to compute cumulative distribution probabilities. 
+        - Is there is less than 100 data points in one of these intersections, we use the last possibilities : a table with `transport_type` x `time` aggregate counts.
+        - The last values with no match are given the overall average of cumulative distribution probabilities for each `transport_type` with no limit for the minimum number of data points.
+
+Following this approach, we can find cumulative distribution probabilities for every combination of `trip_id x stop_id` as defined in `stop_times_df`. We will make a table with the same row order so that McRaptor can easily find their indexes. 
+
+#### Evaluate / Validate recovery tables
+
+The question is : How precise are these recovery table-derived probabilities ? Which one should be used in priority ?
+
+To add ...