<h4 id="Instructions">Instructions<a class="anchor-link" href="#Instructions">¶</a></h4><p><em>This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.</em></p>
<p><em>Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.</em></p>
<h4 id="1.--Print-k=10-topics-extracted-using-LDA-and-give-them-labels.">1. Print k=10 topics extracted using LDA and give them labels.<a class="anchor-link" href="#1.--Print-k=10-topics-extracted-using-LDA-and-give-them-labels.">¶</a></h4>
<strong>Topic 2:</strong> life science and engineering<br>
<strong>Topic 3:</strong> signals<br>
<strong>Topic 4:</strong> speech recognition<br>
<strong>Topic 5:</strong> material sciences<br>
<strong>Topic 6:</strong> architecture<br>
<strong>Topic 7:</strong> optical sciences
<strong>Topic 8:</strong> bio projects<br>
<strong>Topic 9:</strong> bio-chemistry<br>
<strong>Topic 10:</strong> microtech</p>
<h4 id="2.--How-does-it-compare-with-LSI?">2. How does it compare with LSI?<a class="anchor-link" href="#2.--How-does-it-compare-with-LSI?">¶</a></h4><p>LSI:<br>
<strong>Topic 1:</strong> life sciences + computation<br>
<strong>Topic 7:</strong> seismical engineering and speech processing<br>
<strong>Topic 8:</strong> optical labs<br>
<strong>Topic 9:</strong> chemistry<br>
<strong>Topic 10:</strong> bio projects</p>
<p>We see that there are in deed some similaritie sin the topics but the topics are ordered in adifferent way. Also the given words in LDA were easier to label while LSI sometimes gave quite broad/different words for one topic</p>
<h2 id="Exercise-4.9:-Dirichlet-hyperparameters">Exercise 4.9: Dirichlet hyperparameters<a class="anchor-link" href="#Exercise-4.9:-Dirichlet-hyperparameters">¶</a></h2><p>Analyse the effects of α and β. You should start by reading the documentation of pys-park.mllib.clustering.LDA.</p>
<ol>
<li>Fix k=10 and β=1.01, and vary α. How does it impact the topics?</li>
<li>Fix k=10 and α=6, and vary β. How does it impact the topics?</li>
</ol>
<p>Hint: You can set the seed to produce more comparable output</p>
<h4 id="How-does-it-impact-the-topics?">How does it impact the topics?<a class="anchor-link" href="#How-does-it-impact-the-topics?">¶</a></h4><p>$\alpha$ controls the uniformity/sparsity of the topic vectors. We can see this very well above. The smaller $\alpha$ the model tends to map each document to a small set of dominant topics. So we can see for 0.01, that the words describing the topic are very specific and very similar. With 0.1 the words already get a little bit more general. For 1.0 The words seem to get more general and different. The topics are broader but also a little harder to label or to distinguish. The higher the value gets now (10 and 100) the more general the words get and the more often they appear in different topics.For value 100 we can find the words model, project and some others in a lot of topics.</p>
<h4 id="2.--Fix-k=10-and-α=6,-and-vary-β.-How-does-it-impact-the-topics?">2. Fix k=10 and α=6, and vary β. How does it impact the topics?<a class="anchor-link" href="#2.--Fix-k=10-and-α=6,-and-vary-β.-How-does-it-impact-the-topics?">¶</a></h4>
<h4 id="How-does-it-impact-the-topics?">How does it impact the topics?<a class="anchor-link" href="#How-does-it-impact-the-topics?">¶</a></h4><p>$\beta$ is a hyperparameter on the word distribution per topic.
The bigger $\beta$ gets the more similar the words in each topic. With $\beta =100 $ the each topic is described by the same words and in the same order.</p>
<h4 id="1.-Find-the-combination-of-k,-α-and-β-that-gives-the-most-interpretable-topics.">1. Find the combination of k, α and β that gives the most interpretable topics.<a class="anchor-link" href="#1.-Find-the-combination-of-k,-α-and-β-that-gives-the-most-interpretable-topics.">¶</a></h4><p>Trying some values for both values rather small.The results are not too different, so $\alpha = 0.01$ and $\beta = 0.1$ semms to be well interpretable</p>
<h4 id="2.--Explain-why-you-chose-these-values.">2. Explain why you chose these values.<a class="anchor-link" href="#2.--Explain-why-you-chose-these-values.">¶</a></h4><p>As we want well interpretable topics, the goal is to get very specific terms describing the topics rather than getting too general. That way we can distinguish topics better. That is why we want $\alpha$ to be below 1 and $\beta$ also rather low.</p>
<h4 id="3.--Report-the-values-of-the-hyperparameters-that-you-used-and-your-labels-for-the-topic.">3. Report the values of the hyperparameters that you used and your labels for the topic.<a class="anchor-link" href="#3.--Report-the-values-of-the-hyperparameters-that-you-used-and-your-labels-for-the-topic.">¶</a></h4><p>We chose $\alpha=0.01; \beta=0.1; k=7$.<br>
We pick seven topics because there are seven sections at EPFL. If the descriptions work well we should be able to label the topics. As a matter of fact the labelling is easily possible:</p>
<p>Architecture, Civil and Environmental Engineering ENAC - topic 6<br>
Basic Sciences SB - topic 5<br>
Engineering STI - topic 4<br>
Computer and Communication Sciences IC - topic 2<br>
<h2 id="Exercise-4.11:-Wikipedia-structure">Exercise 4.11: Wikipedia structure<a class="anchor-link" href="#Exercise-4.11:-Wikipedia-structure">¶</a></h2><p>Extract the structure in terms of topics from the wikipedia-for-school dataset. Use your intuition about how many topics might be covered by the articles and how they are distributed.</p>
<ol>
<li>Report the values for k, α and β that you chose a priori and why you picked them.</li>
<li>Are you convinced by the results? Give labels to the topics if possible.</li>
<h4 id="1.--Report-the-values-for-k,-α-and-β-that-you-chose-a-priori-and-why-you-picked-them.">1. Report the values for k, α and β that you chose a priori and why you picked them.<a class="anchor-link" href="#1.--Report-the-values-for-k,-α-and-β-that-you-chose-a-priori-and-why-you-picked-them.">¶</a></h4><p>We choose the same values as above as we see that they worked well for k we pick 12. Because according to wikipedia itself there are 12 main categories of articles: <a href="https://en.wikipedia.org/wiki/Category:Main_topic_classifications">https://en.wikipedia.org/wiki/Category:Main_topic_classifications</a></p>
<h4 id="2.--Are-you-convinced-by-the-results?-Give-labels-to-the-topics-if-possible.">2. Are you convinced by the results? Give labels to the topics if possible.<a class="anchor-link" href="#2.--Are-you-convinced-by-the-results?-Give-labels-to-the-topics-if-possible.">¶</a></h4><p>The lables are ok but not totally convincing. This is probably also becaus the picked categories is a categorization made by humans, that does not really seperate all main topics. E.g. reference works can be about any possible topic, also Socienty, people and Culture is very close interveined.
So overall there are some topics that can be lables easily while others are not quite clear, maybe more topics would have helped</p>