<h4 id="Instructions">Instructions<a class="anchor-link" href="#Instructions">¶</a></h4><p><em>This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.</em></p>
<p><em>Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.</em></p>
<h2 id="Exercise-4.4:-Latent-semantic-indexing">Exercise 4.4: Latent semantic indexing<a class="anchor-link" href="#Exercise-4.4:-Latent-semantic-indexing">¶</a></h2><p>Apply SVD with K=300 to your term-document matrix X from the previous exercise.</p>
<ol>
<li>Describe the rows and columns of U and V, and the values of S.</li>
<h4 id="1.--Describe-the-rows-and-columns-of-U-and-V,-and-the-values-of-S.">1. Describe the rows and columns of U and V, and the values of S.<a class="anchor-link" href="#1.--Describe-the-rows-and-columns-of-U-and-V,-and-the-values-of-S.">¶</a></h4><p><strong>$U$: Mapping term to topic </strong></p>
<p>The $n$ rows of the $U$-matrix derived from the SVD, give a mapping from a term to a concept, in our case a topic. Each row maps one term, and each value $v_i$ in that row shows how strongly that term relates to topic $c_i$.</p>
<p><strong>$V^T$: Mapping course to topic</strong></p>
<p>Similarly, the $m$ columns of the $V^T$ matrix show how strongly each course corresponds to each topic.</p>
<p><strong>$S$: Topic importance</strong></p>
<p>The singular values of $S$ shows how important the concept or the topic is. The bigger the value, the more important the topic. So the most important topics can differentiate terms or documents best.</p>
<h4 id="2.--Print-the-top-20-eigenvalues-of-X">2. Print the top-20 eigenvalues of X<a class="anchor-link" href="#2.--Print-the-top-20-eigenvalues-of-X">¶</a></h4>
<h2 id="Exercise-4.5:-Topic-extraction">Exercise 4.5: Topic extraction<a class="anchor-link" href="#Exercise-4.5:-Topic-extraction">¶</a></h2><p>Extract the topics from the term-document matrix X using the low-rank approximation</p>
<ol>
<li>Print the top-10 topics as a combination of 10 terms and 10 documents.</li>
<li>Give a label to each of them</li>
</ol>
<h4 id="1.-Print-the-top-10-topics-as-a-combination-of-10-terms-and-10-documents.">1. Print the top-10 topics as a combination of 10 terms and 10 documents.<a class="anchor-link" href="#1.-Print-the-top-10-topics-as-a-combination-of-10-terms-and-10-documents.">¶</a></h4>
['Chemistry of food processes', 'Frontiers in Organic Synthesis. Part III Stereochemistry', 'Active Remote Sensing of the Atmosphere', 'Energy conversion and renewable energy', 'Topics in theoretical computer science', 'Applied wastewater engineering', 'Machine learning programming', 'Principles and Practicals in X-Ray Scattering', 'Creative Problem Solving in Science and Engineering', 'Advanced diffusional separation processes']
['Théorie et critique du projet MA1 (Huang)', 'Numerical methods in heat transfer', 'Printed systems and large area manufacturing', 'Derivatives', 'Risk and energy', '2D Layered Materials: Synthesis, Properties and Applications', 'Design of Ultra-low Power Wearable Wireless Systems', 'Quantitative methods in finance', 'Fundamentals in Systems Engineering', 'Solid waste engineering']
['Biomicroscopy I', 'Advanced MEMS', 'Biomicroscopy II', 'Optical detectors', 'Electron Microscopy for Life Science', 'Lasers: theory and modern applications', 'Image optics', 'Optics laboratories I', 'Printed systems and large area manufacturing', '2D Layered Materials: Synthesis, Properties and Applications']
['Signal processing for communications', 'Théorie et critique du projet MA2 (Huang)', 'Design of Ultra-low Power Wearable Wireless Systems', 'Image and video processing', 'Théorie et critique du projet MA1 (Huang)', 'Systems and architectures for signal processing', 'Digital Speech and Audio Coding', 'Solid waste engineering', 'Automatic speech processing', 'Speech processing']
['Printed systems and large area manufacturing', 'Théorie et critique du projet MA2 (Gugger)', 'Technology and Public Policy - (c) Technology, intellectual property and innovation policy', 'Théorie et critique du projet BA3 (De Vylder & Taillieu)', 'Risk and energy', 'Théorie et critique du projet MA1 (Gugger)', 'Théorie et critique du projet MA2 (Huang)', 'Théorie et critique du projet MA1 (Huang)', 'Théorie et critique du projet MA2 (Geers)', 'Théorie et critique du projet MA1 (Geers)']
['Integrated optics', 'Théorie et critique du projet MA2 (Geers)', 'Quantitative methods in finance', 'Théorie et critique du projet MA1 (Geers)', '2D Layered Materials: Synthesis, Properties and Applications', 'Optics laboratories I', 'Fracture Mechanics and Fatigue of Structures', 'Lasers: theory and modern applications', 'Structural stability', 'Advanced steel design']
['Microelectronics', 'Energy Autonomous Wireless Smart Systems', 'Advanced MEMS', 'Technology and Public Policy - (b) Technology, policy and regulation', '2D Layered Materials: Synthesis, Properties and Applications', 'Design of Ultra-low Power Wearable Wireless Systems', 'Risk and energy', 'Soft Microsystems Processing and Devices', 'Technology and Public Policy - (c) Technology, intellectual property and innovation policy', 'Printed systems and large area manufacturing']
<h4 id="2.--Give-a-label-to-each-of-them">2. Give a label to each of them<a class="anchor-link" href="#2.--Give-a-label-to-each-of-them">¶</a></h4><p><strong>Topic 1:</strong> life sciences + computation<br>
<h2 id="Exercise-4.6:-Document-similarity-search-in-concept-space">Exercise 4.6: Document similarity search in concept-space<a class="anchor-link" href="#Exercise-4.6:-Document-similarity-search-in-concept-space">¶</a></h2><p>Implement a search function using LSI concept-space, and search for "markov chains" and "facebook".</p>
<ol>
<li>Display the top five courses together with their similarity score for each query.</li>
<h4 id="1.--Display-the-top-five-courses-together-with-their-similarity-score-for-each-query.">1. Display the top five courses together with their similarity score for each query.<a class="anchor-link" href="#1.--Display-the-top-five-courses-together-with-their-similarity-score-for-each-query.">¶</a></h4>
<h4 id="2.--Compare-with-the-previous-section.">2. Compare with the previous section.<a class="anchor-link" href="#2.--Compare-with-the-previous-section.">¶</a></h4><p>previous:</p>
<p>Top five matches for the query markov chain and the according similarity:</p>
<ul>
<li>Applied probability & stochastic processes : 0.586117120443</li>
<li>Molecular and cellular biophysic II : 0.0</li>
</ul>
<p>The "markov chain" search query is very similar, only the fifth document is different. For the "facebook" query we see that we have a much better performance. While the previous section only gave us the one course that had facebook in the description, here we additionally get similar courses as further search results.</p>
<h2 id="Exercise-4.7:-Document-document-similarity">Exercise 4.7: Document-document similarity<a class="anchor-link" href="#Exercise-4.7:-Document-document-similarity">¶</a></h2><p>Find the classes that are the most similar to Internet Analytics.</p>
<ol>
<li>Write down the equation to efficiently compute the similarity between documents.</li>
<li>Print the top 5 classes most similar to COM-308</li>
</ol>
<h4 id="1.--Write-down-the-equation-to-efficiently-compute-the-similarity-between-documents.">1. Write down the equation to efficiently compute the similarity between documents.<a class="anchor-link" href="#1.--Write-down-the-equation-to-efficiently-compute-the-similarity-between-documents.">¶</a></h4><p>The best way for document-document similarity given two documents is computing cosine similarity between the given document represented as a topic vector and the other document's topic vector. Let the topic vector of document $i$ be denoted as $v_t$. This vector is actually the column $i$ in the matrix $V^T$.
So for similarity betweeen document $i$ and $j$ we compute: $cosine\_sim(v_i,v_j)$</p>
<p>So to get the cosine similarity between one document and all others we apply the formula above to the given topic vector and each other topic vector from $V^T$ (eacuh column).</p>
<h4 id="2.--Print-the-top-5-classes-most-similar-to-COM-308">2. Print the top 5 classes most similar to COM-308<a class="anchor-link" href="#2.--Print-the-top-5-classes-most-similar-to-COM-308">¶</a></h4>