Diffusion SHS_HUM_DIG (master)

Edit
SHS_HUM_DIG
ActivePublic

SHS - Humanité Digital - Relations entre la Suisse et l'Iran durant la Guerre Froide, avec pour exemple la crise des otages américains

SHS_HUM_DIG (master)

.gitignore
.ipynb_checkpoints/
ExtractIran.ipynb
ExtractIran_pierre.ipynb
IranIrak.txt
README.md
class-outline.md
data/

Recent Commits

Commit	Author	Details	Committed
585e4c354181	cognet	IranIrak	Mar 6 2019
aa786cfacaa7	bmean	add data	Mar 6 2019
81c3e1478233	bmean	New change	Mar 6 2019
3541dbfbf0ab	bmean	new file	Mar 6 2019
359f5c76ead4	bmean	add gitignore	Feb 27 2019
ccc0be5a9209	bmean	add file	Feb 27 2019

README.md

epfl-shs-class

Set of instructions for using data in the frame of EPFL SHS class

Part 1 - How to get the data

Read and sign the NDA, and give it back to the teachers.

Generate a ssh-key, following these instructions, and send the content of the .ssh/id_rsa.pub file to the teachers.

Important: put a password on your key

Once you receive the green light, download the data
- open a terminal and go to the repository where you want to download the data
- sftp impresso@dhlabsrv4.epfl.ch => connect to the server via sftp (more instructions here)
- cd sharespace => go in the folder where the data is
- ll => to see what is there
- mget *.bz2 to get all file ending with .bz2
- exit to exit the server

Part 2 - How to transform it

NB: before reading further, install [jq](https://github.com/stedolan/jq/wiki/Installation), in case it's not yet installed on your system.

Data is in the form of bz2 archives. These archives are on a journal-year basis, and contains newspaper articles, which have been 'rebuilt' from the OCR output. The format is json-lines: each line is a json object, i.e. an article.

Each article contains more information that what you need so it is a good idea to filter out things and get a version of what interests you only. In the folder where you have the archives, execute the following command:

for f in *[0-9].jsonl.bz2; do bzcat $f | jq -c '{id: .id, type: .tp, date: .d, title: .t, fulltext: .ft}' | bzip2 > "${f%.jsonl.bz2}-reduced.jsonl.bz2" ; done

what does the command do:

iterate over the files having the suffix .jsonl.bz2 preceded by a number (each file lies in the variable $f)
open the archive (bzcat) and produce a stream of json
send (pipe |) this stream into jq
apply some filtering on the json content
send the output to a file which name is composed of the input file, completed with -reduced

You will now on work with the archives -reduced.jsonl.bz2. You can delete the others.

Part 3 - Setting up your working environement

Python environment

Download Anaconda in order to get the Conda environement manager.
Familiarize yourself with Conda
Open a terminal, go to your working repository and create an environement:

conda create -n NAME python=3.6 where NAME is the name you want to give to the environement (e.g. digital-history)

Activate it:

source activate NAME

Useful commands (and more info here):

conda info --envs => list your environments
source deactivate => deactivate an env
conda remove --name NAME --all => remove environment 'NAME'

Working with Jupyter notebook

What it is: see this tutorial

Conda already installs by default Jupyter when you create an environment.

To launch a notebook, just execute this in your activated env: jupyter notebook

Starting working with the data

We've put a jupyter notebook in this repo (Example.ipynb) where you can get an idea where to start.

If you want to use Iramuteq, you will have to isolate the textual parts and print them as specified here.

EditSHS_HUM_DIGActivePublic