<h4 id="Instructions">Instructions<a class="anchor-link" href="#Instructions">¶</a></h4><p><em>This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.</em></p>
<p><em>Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.</em></p>
<pre>{'courseId': 'MSE-440', 'name': 'Composites technology', 'description': "The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures\xa0Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites\xa0ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite materialsDiscuss the main types of composite applications Transversal skills Use a work methodology appropriate to the task.Use both general and domain specific IT resources and toolsCommunicate effectively with professionals from other disciplines.Evaluate one's own performance in the team, receive and respond appropriately to feedback. Teaching methods Ex cathedra and invited speakers Group sessions with exercises or work on the project Expected student activities Attendance at lectures Design of a composite part, bibliography search \xa0 Assessment methods Written exam report and oral presentation in class"}
<h2 id="Exercise-4.1:-Pre-processing">Exercise 4.1: Pre-processing<a class="anchor-link" href="#Exercise-4.1:-Pre-processing">¶</a></h2><p>Pre-process the corpus to create bag-of-words representations of each document. You are free
to proceed as you wish.</p>
<p>We are using the follwoing approaches:</p>
<ul>
<li>Remove the stopwords.</li>
<li>Remove the punctuation.</li>
<li>Remove the very frequent words.</li>
<li>Remove the very infrequent words</li>
<li>Stem the words (look for
stemming
online).</li>
<li>Lemmatise the words (look for
lemmatization
online).</li>
<li>Add bigrams</li>
</ul>
<ol>
<li>Explain which ones you implemented and why.</li>
<li>Print the terms in the pre-processed description of the IX class in alphabetical order.</li>
</ol>
<hr>
<h4 id="Remove-stopwords-and-punctuation">Remove stopwords and punctuation<a class="anchor-link" href="#Remove-stopwords-and-punctuation">¶</a></h4><p><strong>Why:</strong> Punctuation only helps humans to structure and read the sentences but does not carry useful information about the topic. Thus, we can remove it. Stopwords are similar, as they represent the most common words in a language and usually only help connect and build a sentnce without adding meaning to it (e.g. think of articles (a,the..) or connectors (and,or, hence..) ). So stopwords are also removed.</p>
<span class="n">line</span> <span class="o">=</span> <span class="n">digit_split</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="c1">#split words concatinated with a number (without space inbetween)</span>
<h4 id="Remove-the-very-frequent-words">Remove the very frequent words<a class="anchor-link" href="#Remove-the-very-frequent-words">¶</a></h4><p><strong>Why:</strong> Below we can see the word distribution of the 35 most frequent words over all documents (not cumulated). This is a power law distribution. That means that the most common words are used much more often than the other words. Using a word too often makes a word useless for topic classification as it will be part of any topic. The most frequent words (methods, learning, student) prove this point. Of course they are used often in course descriptions because the words are closesly connected to university but they are too broad to indicate a specific topic or field of studies. Therefore, we remove the most frequent words.</p>
<p>The plot below shows the most frequent words. We decide to delete the first ten words as methods..students..end seem very vague. After that the words assesment, outcomes, prerequisites seem very general,too but they might already contain some info. E.g. there might be some courses with more prerequisites than others.</p>
<h4 id="Remove-the-very-infrequent-words">Remove the very infrequent words<a class="anchor-link" href="#Remove-the-very-infrequent-words">¶</a></h4><p><strong>Why:</strong> Below we can see words that only appear once over all topics. We notice that apart from very special terms many of those words are misspellings of more frequent words. It is quite hard to find and correct all misspellings. As the words are so infrequent they do not really carry any meaning and should thus be removed.</p>
<h4 id="Remove-most-and-least-frequent-words">Remove most and least frequent words<a class="anchor-link" href="#Remove-most-and-least-frequent-words">¶</a></h4>
<h4 id="Stemming-vs-Lemmatization">Stemming vs Lemmatization<a class="anchor-link" href="#Stemming-vs-Lemmatization">¶</a></h4><p>The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.</p>
<p>Word stemming means removing affixes from words and return the root word.
Lemmatization is a similar approach but based on a vocabulary and morphological analysis of words. It aims to return the base or dictionary form of a word (lemma).</p>
<p>So the difference is that stemming can create non-existent words, whereas lemmas are actual words.</p>
<p>Thus, lemmatization is probably more accurate as it works with knowledge of the language. We try both methods separately in order to be able to use either lemmas or stems for the later analysis and compare the results</p>
<h4 id="Add-bigrams">Add bigrams<a class="anchor-link" href="#Add-bigrams">¶</a></h4><p><strong>Why:</strong> Even though each single word always provides some meaning, groups of words could give more information. That is because often concepts or ideas are not only described in one word. (e.g. social network). So we hope to win some precision by using bigrams on the lemmas.</p>
'description': "The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures\xa0Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites\xa0ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite materialsDiscuss the main types of composite applications Transversal skills Use a work methodology appropriate to the task.Use both general and domain specific IT resources and toolsCommunicate effectively with professionals from other disciplines.Evaluate one's own performance in the team, receive and respond appropriately to feedback. Teaching methods Ex cathedra and invited speakers Group sessions with exercises or work on the project Expected student activities Attendance at lectures Design of a composite part, bibliography search \xa0 Assessment methods Written exam report and oral presentation in class",
<h4 id="2.--Print-the-terms-in-the-pre-processed-description-of-the-IX-class-in-alphabetical-order.">2. Print the terms in the pre-processed description of the IX class in alphabetical order.<a class="anchor-link" href="#2.--Print-the-terms-in-the-pre-processed-description-of-the-IX-class-in-alphabetical-order.">¶</a></h4><p>The fisrt array shows the soted stems the second one the sorted lemmas, the last one the sorted bigrams</p>
<h2 id="Exercise-4.2:-Term-document-matrix">Exercise 4.2: Term-document matrix<a class="anchor-link" href="#Exercise-4.2:-Term-document-matrix">¶</a></h2><p>Construct an M×N term-document matrix X, where M is the number of terms and N is thenumber of documents. The matrix X should besparse. You are not allowed to use libraries for this task (i.e., the computation of TF-IDF must be implemented by you.)</p>
<ol>
<li>Print the 15 terms in the description of the IX class with the highest TF-IDF scores.</li>
<li>Explain where the difference between the large scores and the small ones comes from.</li>
</ol>
<p><em>Hint: It is useful for this exercise and the rest of the lab to keep track of the mapping between
terms and their indices, and documents and their indices.</em></p>
<p>We can see that the stems and lemmas with the highest scores are rather similar only the order is a little different.</p>
<p>The bigrams show similar words as the methods above only they are followed by another word that seems to add informnation. Computing on bigrams took a little longer</p>
<h4 id="2.-Explain-where-the-difference-between-the-large-scores-and-the-small-ones-comes-from">2. Explain where the difference between the large scores and the small ones comes from<a class="anchor-link" href="#2.-Explain-where-the-difference-between-the-large-scores-and-the-small-ones-comes-from">¶</a></h4><p>The TFIDF value indicates how often a word appears in a document combined with its appearance over all documents. So the words with high scores appear often in the Internet Analytics description while they appear rarely in other course desciptions.<br>
Words with the lowest scores are very common in the course descriptions but can only be found rarely (or not at all) in the Internet Analytics description.</p>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Top five matches for the query "</span><span class="p">,</span><span class="n">query_name</span><span class="p">,</span><span class="s2">" and the according similarity:"</span><span class="p">)</span>
<h4 id="1.--Display-the-top-five-courses-together-with-their-similarity-score-for-each-query.">1. Display the top five courses together with their similarity score for each query.<a class="anchor-link" href="#1.--Display-the-top-five-courses-together-with-their-similarity-score-for-each-query.">¶</a></h4><p>Displaying for TDIDF using stems and lemmas for comparison</p>
<p><strong>Lemmas vs stems:</strong> We can see that both methods produce the same top 5 matches for both queries with the same orders and only a small difference in the similarity. So it the methods seem to work equally well</p>
<p><strong>Bigrams:</strong> Interestingly the bigrams return a result that is a little different. While the first three courses in the markov chain query are similar the last two are not. That is because only the combination of both words is searched. The single models look for chain and markov seperately. Therefore, supply chain topics, that talk about chains a lot are in the top 5. Also the overall similarities are smaller than for the models above.<br>
So it seems that bigrams might perform a little bit better.</p>
<h4 id="2.--What-do-you-think-of-the-results?-Give-your-intuition-on-what-is-happening.">2. What do you think of the results? Give your intuition on what is happening.<a class="anchor-link" href="#2.--What-do-you-think-of-the-results?-Give-your-intuition-on-what-is-happening.">¶</a></h4><p>We observe that while the first query gives good results, the second query 'facebook' only returns one document with a non-zero 'similarity'. That is because the word 'facebook' does only occur in that one course description but not in others. So we are not be able to give any relevance for other documents, even though they might contain similar phrases like 'social media'.</p>
<p>This shows a disadvantage of vector space retrieval models, as they are not able to generalize the concepts of the searched terms and fail if the specific word cannot be found oin a document.</p>