Page Menu
Home
c4science
Search
Configure Global Search
Log In
Files
F91997812
data_processing.tex
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Subscribers
None
File Metadata
Details
File Info
Storage
Attached
Created
Sat, Nov 16, 11:58
Size
4 KB
Mime Type
text/x-tex
Expires
Mon, Nov 18, 11:58 (2 d)
Engine
blob
Format
Raw Data
Handle
22360948
Attached To
R10013 cop-mining-participants
data_processing.tex
View Options
\chapter
{
Data Extraction and Processing
}
The main part of this project is to extract the information contained in the participant lists of UNFCCC meetings.
We explain in this chapter how we extracted and processed the data of the PDF lists.
\section
{
Data Extraction
}
The first step consisted of transforming the available PDF files into text files. It is important to keep some information
about the structure of the text to be able to find the relevant data in the resulting text file.
\subsection
{
Raw Dataset
}
We download the participant lists from the document webpage of the UNFCCC secretariat.
\cite
{
UNFCCC
_
docs
}
The analyzed lists contain all the COP meetings and almost all the SB meetings. Note that during a COP, there is always an SB meeting
held in parallel for which there is no separate participant list.
\\
A participant list has the following general structure: Participants are listed under the affiliation they belong to. A member of the
Swiss government is for example listed under the affiliation “Switzerland”. A participant is attributed a salutary address that contains
at least "Mr." or "Ms.", but may also contain some titles as "H.E." (i.e. "Her Excellence"). Some participants, but not all of them, are
attributed a description that explains their role within the delegation. This description could for example be "Minister of Foreign Affairs".
Affiliations are sorted according to their affiliation category and then alphabetically. The possible affiliation categories are:
\begin
{
itemize
}
\item
Parties
\item
Observer States
\item
United Nations Secretariat units and bodies
\item
Specialized agencies and related organizations
\item
Intergovernmental organizations
\item
Non-governmental organizations
\end
{
itemize
}
The category "Media" exists for newer participant lists, but the corresponding participants are not listed. We therefore exclude this category.
\\
The format of the participant lists varies over time. For the first meetings the participant lists are paper scans, what means that
we need to convert images to text. Furthermore, the manner in which affiliations are indicated varies, in the first meetings they are
always written in all uppercase letters, which was changed in later meetings.
\\
We choose the version of the participant lists that is published during the last days of a meeting.
We exclude the corrigenda, documents that are published later for some participant lists and contain corrections of the lists,
because their format varies a lot
and many of the listed corrections are rather small (change of order of participants within an affiliation, change of descriptions).
\subsection
{
Optical Character Recognition
}
To extract the data from the scanned lists, we use Optical Character Recognition (OCR), more precisely Python-tesseract (pytesseract).
\cite
{
pytesseract
}
Python-Tesseract is a wrapper for the OCR-engine Tesseract that is developed by Google since 2006, open-source and
available under the Apache 2.0 license.
\\
% TODO first check what options I finally use, then describe how tesseract works.
% TODO change title
\subsection
{
Well-formatted PDF Extraction
}
To extract the data from the well-formatted PDF files, we use a PDF processing package called Pdfminer.six.
\cite
{
pdfminer.six
}
This python package is community-maintained and open-source.
Again, the main difficulty is to extract the text of the list in correct order. Especially for documents with three columns, this
becomes a difficult task. For this reason, we adapted the use of Pdfminer.six by rewriting one of the classes, the
\texttt
{
PDFPageAggregator
}
.
\\
First, we explain quickly how pdfminer extracts text from PDF files. Pdfminer.six perform a layout analysis on every page before
extracting the text. This analysis is done in three stages:
\begin
{
itemize
}
\item
Group characters to words and lines
\item
Group lines to boxes
\item
Group textboxes hierarchically
\end
{
itemize
}
The output of the layout analysis is visualized in figure
\ref
{
fig:pdfminer
}
.
\\
\begin
{
figure
}
[ht]
\caption
{
Output of the layout analysis of pdfminer.six
}
\centering
\includegraphics
[width=0.9\textwidth]
{
pdfminer.png
}
\label
{
fig:pdfminer
}
\end
{
figure
}
The class we want to modify,
\texttt
{
PDFPageAggregator
}
, is responsible for outputing the text lines of a page in the determined order.
To be able to sort the text lines according to our rules later, we modify the function
\texttt
{
receive
\_
layout
}
such that it outputs
for each LTTextLine the available x and y positions within the page. In our script that performs the extraction, we then define rules to
determine on which column a text line is situated.
\\
A special case for the page layout are affiliation category titles. They break the column system in the middle of a page. We therefore
need to recognize them by their content and introduce special rules for pages that contain affiliation category titles.
\section
{
Data Processing
}
Event Timeline
Log In to Comment