SHS - Humanité Digital - Relations entre la Suisse et l'Iran durant la Guerre Froide, avec pour exemple la crise des otages américains
Recent Commits
Commit | Author | Details | Committed | ||||
---|---|---|---|---|---|---|---|
585e4c354181 | cognet | IranIrak | Mar 6 2019 | ||||
aa786cfacaa7 | bmean | add data | Mar 6 2019 | ||||
81c3e1478233 | bmean | New change | Mar 6 2019 | ||||
3541dbfbf0ab | bmean | new file | Mar 6 2019 | ||||
359f5c76ead4 | bmean | add gitignore | Feb 27 2019 | ||||
ccc0be5a9209 | bmean | add file | Feb 27 2019 |
README.md
epfl-shs-class
Set of instructions for using data in the frame of EPFL SHS class
Part 1 - How to get the data
- Read and sign the NDA, and give it back to the teachers.
- Generate a ssh-key, following these instructions, and send the content of the .ssh/id_rsa.pub file to the teachers.
Important: put a password on your key
- Once you receive the green light, download the data
- open a terminal and go to the repository where you want to download the data
- sftp impresso@dhlabsrv4.epfl.ch => connect to the server via sftp (more instructions here)
- cd sharespace => go in the folder where the data is
- ll => to see what is there
- mget *.bz2 to get all file ending with .bz2
- exit to exit the server
Part 2 - How to transform it
NB: before reading further, install [jq](https://github.com/stedolan/jq/wiki/Installation), in case it's not yet installed on your system.
Data is in the form of bz2 archives. These archives are on a journal-year basis, and contains newspaper articles, which have been 'rebuilt' from the OCR output. The format is json-lines: each line is a json object, i.e. an article.
Each article contains more information that what you need so it is a good idea to filter out things and get a version of what interests you only. In the folder where you have the archives, execute the following command:
for f in *[0-9].jsonl.bz2; do bzcat $f | jq -c '{id: .id, type: .tp, date: .d, title: .t, fulltext: .ft}' | bzip2 > "${f%.jsonl.bz2}-reduced.jsonl.bz2" ; done
what does the command do:
- iterate over the files having the suffix .jsonl.bz2 preceded by a number (each file lies in the variable $f)
- open the archive (bzcat) and produce a stream of json
- send (pipe |) this stream into jq
- apply some filtering on the json content
- send the output to a file which name is composed of the input file, completed with -reduced
You will now on work with the archives -reduced.jsonl.bz2. You can delete the others.
Part 3 - Setting up your working environement
Python environment
- Download Anaconda in order to get the Conda environement manager.
- Familiarize yourself with Conda
- Open a terminal, go to your working repository and create an environement:
conda create -n NAME python=3.6 where NAME is the name you want to give to the environement (e.g. digital-history)
- Activate it:
source activate NAME
Useful commands (and more info here):
conda info --envs => list your environments source deactivate => deactivate an env conda remove --name NAME --all => remove environment 'NAME'
Working with Jupyter notebook
What it is: see this tutorial
Conda already installs by default Jupyter when you create an environment.
To launch a notebook, just execute this in your activated env: jupyter notebook
Starting working with the data
We've put a jupyter notebook in this repo (Example.ipynb) where you can get an idea where to start.
If you want to use Iramuteq, you will have to isolate the textual parts and print them as specified here.