Jekyll2021-03-23T17:04:38+00:00https://rsilveira79.github.io/fermenting_gradients/atom.xmlFermenting GradientsA blog about machine learning and fermented foods.Roberto Silveirarsilveira79@gmail.comCOVID 19 Under-reporting estimation2020-04-19T18:00:00+00:002020-04-19T18:00:00+00:00https://rsilveira79.github.io/fermenting_gradients/stats/under-reporting/covid/covid-underestimate-python<h1 id="using-python-pandas--numpy--scipy">Using python (Pandas + Numpy + Scipy)</h1>
<p>In this pandemic time, data-scientist and machine learning engineers are stepping in and building models to help policymakers take decisions under very uncertain moments. A great example is a friend of mine, Christian Perone, who is spending lots of energy to elucidate what is happening using Bayesian inference as the main tool. Check out his dedicated website including several different analysis: <a href="https://perone.github.io/covid19analysis/">Christian Perone - COVID-19 Analysis Repository</a></p>
<p>Christian pointed me one article from <a href="mailto:timothy.russell@lshtm.ac.uk">Timothy W Russell</a> on how to estimate COVID-19 under-reporting using delay-adjusted case fatality ration. More details on Timothy’s paper can be found here <a href="#under_report">1</a>.</p>
<p>Timothy has already provided a code in R <a href="https://github.com/thimotei/CFR_calculation">CFR calculation</a> that can be used.
What I did here is to translate this code in R to Python, so that Pythonistas can use Pandas/Numpy/Scipy to perform the same calculations, or to replace the dataframe with data from your community.</p>
<h2 id="method-for-estimating-under-reporting">Method for estimating under-reporting</h2>
<p>I’m not entering in detail about the method itself, if you need more information please refer to Timothy’s paper <a href="#under_report">1</a>. Basically, dividing <em>deaths-to-date</em> by <em>cases-to-date</em> leads to a biased estimate of case fatality ratio (CFR), because this simple method does not account for delays from confirmation of a case to death, and under-reporting of cases.
This method adjusts the CFR by using using the distribution of the delay from <em>hospitalisation-to-death</em>, assuming that this delay is the same as <em>confirmation-to-death</em>. The distribution uses a <em>Lognormal</em> fit, with a mean delay of 13 days and a standard deviation of 12.7 days.</p>
<p>The under-estimation can be calculated as:
\(u_{t}=\frac{\sum_{j=0}^{t}c_{t-j}f_{j}}{c_{t}}\)<br />
where:<br />
\(u_{t}\) = underestimation of the proportion of cases with known outcomes<br />
\(c_{t}\) = daily case incidence at time t<br />
\(f_{t}\) = proportion of cases with delay of t between confirmation and death</p>
<p>For lognormal fit, I used scipy function <code class="language-plaintext highlighter-rouge">lognorm</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">plnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">shape</span> <span class="o">=</span> <span class="n">sigma</span>
<span class="n">loc</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">scale</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">mu</span><span class="p">)</span>
<span class="k">return</span> <span class="n">lognorm</span><span class="p">.</span><span class="n">cdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">shape</span><span class="p">,</span> <span class="n">loc</span><span class="p">,</span> <span class="n">scale</span><span class="p">)</span>
</code></pre></div></div>
<p>This will be used in the <code class="language-plaintext highlighter-rouge">hospitalisation_to_death_truncated</code> function, that is the delay function used in the adjustment:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hospitalisation_to_death_truncated</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">mu</span><span class="p">,</span><span class="n">sigma</span><span class="p">):</span>
<span class="k">return</span> <span class="n">plnorm</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span> <span class="o">-</span> <span class="n">plnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
</code></pre></div></div>
<p>The cCFR (corrected Case Fatality Ration) is calculated in this loop:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cumulative_known_t</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">ii</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)):</span>
<span class="n">known_i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">jj</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">ii</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
<span class="n">known_jj</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'new_cases'</span><span class="p">].</span><span class="n">loc</span><span class="p">[</span><span class="n">ii</span><span class="o">-</span><span class="n">jj</span><span class="p">]</span><span class="o">*</span><span class="n">delay_func</span><span class="p">(</span><span class="n">jj</span><span class="p">)</span>
<span class="n">known_i</span> <span class="o">=</span> <span class="n">known_i</span> <span class="o">+</span> <span class="n">known_jj</span>
<span class="n">cumulative_known_t</span> <span class="o">=</span> <span class="n">cumulative_known_t</span> <span class="o">+</span> <span class="n">known_i</span>
<span class="n">cum_known_t</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="n">cumulative_known_t</span><span class="p">)</span>
<span class="n">nCFR</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'new_deaths'</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">df</span><span class="p">[</span><span class="s">'new_cases'</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">cCFR</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'new_deaths'</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">cum_known_t</span>
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">delay_func</code> is the <code class="language-plaintext highlighter-rouge">hospitalisation_to_death_truncated</code> speficied above (with <strong>low</strong>, <strong>mid</strong> and <strong>high</strong> mean and standard deviations).</p>
<p>The code can be found in this Jupyter Notebook here:</p>
<p><a href="https://github.com/rsilveira79/CFR_calculation_python/blob/master/notebooks/1.initial_assessment.ipynb" class="btn btn--success">COVID 19 - Under-reporting estimation</a></p>
<p>Also, I configured a Github Action to execute this code every 12 hours and output the <strong>.CSV</strong> file as result in <strong>output</strong> folder of this <a href="https://github.com/rsilveira79/CFR_calculation_python">repo</a>.</p>
<h2 id="references">References</h2>
<ol>
<li><strong>Using a delay-adjusted case fatality ratio to estimate under-reporting</strong> <a name="under_report"><a href="https://cmmid.github.io/topics/covid19/global_cfr_estimates.html">Link</a></a><br />Timothy W Russell*, Joel Hellewell1, Sam Abbott1, Nick Golding, Hamish Gibbs, Christopher I Jarvis, Kevin van Zandvoort, CMMID nCov working group, Stefan Flasche, Rosalind Eggo, W John Edmunds & Adam J Kucharski, 2020</li>
</ol>Roberto Silveirarsilveira79@gmail.comUsing python (Pandas + Numpy + Scipy)Sampling diverse NeurIPS papers using Determinantal Point Process (DPP)2019-09-24T01:06:00+00:002019-09-24T01:06:00+00:00https://rsilveira79.github.io/fermenting_gradients/machine_learning/nlp/pytorch/dpp/probabilistic/nips-paper-diversity-dpp<h1 id="sampling-diverse-neurips-papers-using-determinantal-point-process-dpp">Sampling diverse NeurIPS papers using Determinantal Point Process (DPP)</h1>
<p>It is NeurIPS time! This is the time of the year where NeurIPS (or NIPS) papers are out, abstracts are approved and developers and researchers got crazy with breadth and depth of papers available to read (and hopefully to reproduce/implement). Every person have a different approach to skim papers and discover new stuff.
One approach is to pick one specific area of interest (natural language processing, computer vision, bayesian inference, optimization, etc), and go deep into a subject. Or one can go random and select at chance some papers to explore, and then exploit papers in the preferred domain.
We will present a different approach here, where we will be able to select papers by <strong>diversity</strong>, meaning that we are willing to select papers that tend not to overlap and be equally distributed to each other.
To do so, the idea here is to use determinantal point processes (DPPs) as a way to capture more diverse samples in a given sampling space. For a more detailed reading on DPPs, please consider reading the awesome paper <em>Determinantal point processes for machine learning</em> <sup> <a href="#dpp_ml">1</a></sup>. A point process \(\mathcal{P}\) on a discrete set \(\mathcal{Y}\) is a probability measure on \(2^\mathcal{Y}\), all the possible subsets of \(\mathcal{Y}\). In a <strong>determinantal</strong> point process, a random subset <strong>Y</strong> will have, for every subset \(\mathcal{A}\) contained in \(\mathcal{Y}\) (\(\mathcal{A} \subseteq \mathcal{Y}\)) :</p>
\[\mathcal{P}(\mathcal{A} \subseteq \mathcal{Y}) = det(K_{A})\]
<p>\(K_{A}\equiv[K_{i,j}]_{i,j\in A}\) in equation above represents the marginal kernel to compute probabilities of points \(i,j\) of any subset \(\mathcal{A}\) to be included in <strong>Y</strong>. By using determinal rule, we can represent \(\mathcal{P}\) as:</p>
\[\mathcal{P}(i \in Y) = K_{ii} \\
\mathcal{P}(i,j \in Y) = K_{ii}K_{jj}-K_{ij}K_{ji} \\
\mathcal{P}(i,j \in Y) = \mathcal{P}(i \in Y) \mathcal{P}(j \in Y)-K_{ij}^2\]
<p>The last term in equation above (\(K_{ij}^2\)) determines the (anti) correlations bewtweens pairs of elements, meaning the large values of \(K_{ij}\) (high correlation in points) will tend <strong>not</strong> do co-occur. This is the part in the DPP formulation that ensures that it will sample for diversity.</p>
<p>In this experiment, we will like to sample a specific number of papers <em>k</em>, so in this case we will be using <strong>k-DPP</strong> <sup><a href="#kdpp">2</a></sup> instead of pure <strong>kDPP</strong>. <strong>k-DPP</strong> conditions <strong>DPP</strong> with cardinality <em>k</em>, and ends up being a mixture of elementary <strong>DPP</strong>, by given nonzero weight \(\lambda_{n}\) to elements with dimension <em>k</em>. We experimented the really nice python library \(DPPy\) by Guillaume Gautier (more details here <sup><a href="#dppy_paper">3</a></sup> and here <sup><a href="#dppy_git">4</a></sup>) to run the code below.</p>
<p>First things, all the code on this post can be found in this Colab notebook:<br />
<a href="https://colab.research.google.com/drive/1TWdpRN7D7UEsALab3Ej5EMqOiiG35Lol" class="btn btn--success">2018 NeurIPS diverse papers w/ DPPy</a></p>
<p>For this experiment, we used NeurIPS 2018 papers from this <a href="http://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018">link</a>, as the 2019 papers are not yet available at this page (making easier to get abstracts using tools such as <code class="language-plaintext highlighter-rouge">BeautifulSoup</code>). Once the 2019 papers are available I will add the Colab notebook here.</p>
<h2 id="getting-papers-title-and-abstract-with-beautifulsoup">Getting papers Title and Abstract with BeautifulSoup</h2>
<p>I used the awesome Python lib BeautifulSoup in order to extract the papers Title and Abstract from NeurIPS official <a href="http://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018">page</a>. The first job is to get all paper title URL, and then explore each paper link and extract the abstract by searching for the <code class="language-plaintext highlighter-rouge">subtitle</code> and <code class="language-plaintext highlighter-rouge">abstract</code> classes. This is the function to extract abstract from a given URL:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_title_abstract</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">text</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'class'</span> <span class="p">:</span> <span class="s">'subtitle'</span><span class="p">})</span>
<span class="n">abstract</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'class'</span> <span class="p">:</span> <span class="s">'abstract'</span><span class="p">})</span>
<span class="k">return</span> <span class="n">title</span><span class="p">.</span><span class="n">get_text</span><span class="p">(),</span> <span class="n">abstract</span><span class="p">.</span><span class="n">get_text</span><span class="p">()</span>
</code></pre></div></div>
<p>And here extracting all 2018 papers Title and Abstract and embedding information into a Pandas DataFrame:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">links_url</span> <span class="o">=</span> <span class="s">'http://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018'</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">links_url</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">text</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">)</span>
<span class="n">main_wrapper</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'class'</span> <span class="p">:</span> <span class="s">'main wrapper clearfix'</span><span class="p">})</span>
<span class="n">paper_urls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">main_wrapper</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="k">if</span> <span class="s">'/paper/'</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
<span class="n">paper_urls</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"http://papers.nips.cc"</span> <span class="o">+</span> <span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Total papers in 2018: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">paper_urls</span><span class="p">)))</span>
<span class="n">nips_2018</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm_notebook</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">paper_urls</span><span class="p">))):</span>
<span class="n">t</span><span class="p">,</span> <span class="n">a</span> <span class="o">=</span> <span class="n">get_title_abstract</span><span class="p">(</span><span class="n">paper_urls</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">nips_2018</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"title"</span><span class="p">:</span><span class="n">t</span><span class="p">,</span> <span class="s">"abstract"</span><span class="p">:</span><span class="n">a</span><span class="p">}</span>
<span class="n">nips_2018_dataframe</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">nips_2018</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s">'index'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="embedding-text-with-bert">Embedding text with BERT</h2>
<p>Now that we have all the necessary information in a dataframe, we will use sentence embeddings, more specifically BERT embeddings <sup><a href="#bert">5</a></sup>, to convert text into a vector in a latent space. To make things easier, I used the nice lib Flair from Zalando Research, averaging the embeddings (output dimension is = <strong>3072</strong>). As input for the BERT embedder, the title and abstract were concatenated and had stopwords removed (no stemming or lemmatization were applied).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">flair.embeddings</span> <span class="kn">import</span> <span class="n">BertEmbeddings</span>
<span class="n">bert_embedding</span> <span class="o">=</span> <span class="n">BertEmbeddings</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">get_bert_embedding</span><span class="p">(</span><span class="n">sent</span><span class="p">):</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="n">Sentence</span><span class="p">(</span><span class="n">sent</span><span class="p">)</span>
<span class="n">bert_embedding</span><span class="p">.</span><span class="n">embed</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span>
<span class="n">all_tensors</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">bert_embedding</span><span class="p">.</span><span class="n">embedding_length</span><span class="p">)</span>
<span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">sentence</span><span class="p">:</span>
<span class="n">all_tensors</span><span class="o">+=</span><span class="n">token</span><span class="p">.</span><span class="n">embedding</span>
<span class="k">return</span> <span class="n">all_tensors</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="paper-diversity-with-dppy">Paper diversity with DPPy</h2>
<p>Now to the final and most exciting part, on how to select diverse papers. In order to compare diverse papers sampled with DPP, I selected the closest (nearest) papers by applying cosine similarity of a given random paper with the whole set of 1009 papers from 2018 NeurIPS conference. In order to plot those papers in 2D, I used TSNE from Scikit-Learn <sup><a href="#sklearn">7</a></sup> (with perplexity = <strong>5</strong>).</p>
<p>The distribution of the closest (nearest) papers given a random selected paper is shown in the figure below. As we can see, most of the papers cluster together in the top right corner of the picture, confirming that they are similar when projecting the embeddings into 2D plane:</p>
<p><img src="/fermenting_gradients/assets/images/nearest_visualization.png" /></p>
<p>As for the random sampled papers, we can see on the figure below that they seem to be cluttered in some parts of the plane (bottom right) and not equally distributed:</p>
<p><img src="/fermenting_gradients/assets/images/random_visualization.png" /></p>
<p>Finally, the figure below show papers sampled using <strong>k-DPP</strong> from DPPy library, with <em>k</em> size of <strong>10</strong> and likelihood kernel. As we can see, the distribution seem to be more diverse and more distributed in space than the previous methods (nearest, random):</p>
<p><img src="/fermenting_gradients/assets/images/diverse_visualization.png" /></p>
<p>Finally, in order to compare these three sets in a more quantitative way, I measured average Jaccard similarity, cosine similarity and euclidean distance of <code class="language-plaintext highlighter-rouge">nearest</code> , <code class="language-plaintext highlighter-rouge">random</code> and <code class="language-plaintext highlighter-rouge">diverse</code> sets. It is interesting to note that, as the <code class="language-plaintext highlighter-rouge">diverse</code> set is more equally distributed in space, it’s Jaccard and Cosine similarity are lower, and the average Euclidean distance of the set is higher than the <code class="language-plaintext highlighter-rouge">nearest</code> and the <code class="language-plaintext highlighter-rouge">random</code> sets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">JACCARD</span> <span class="n">SIMILARITY</span> <span class="o">---</span>
<span class="n">RANDOM</span><span class="p">:</span> <span class="mf">0.05260996952409507</span>
<span class="n">DIVERSE</span><span class="p">:</span> <span class="mf">0.04059503829222504</span>
<span class="n">NEAREST</span><span class="p">:</span> <span class="mf">0.04299341079777443</span>
<span class="n">COSINE</span> <span class="n">SIMILARITY</span> <span class="o">---</span>
<span class="n">RANDOM</span><span class="p">:</span> <span class="mf">0.9287719594107734</span>
<span class="n">DIVERSE</span><span class="p">:</span> <span class="mf">0.9257738563749526</span>
<span class="n">NEAREST</span><span class="p">:</span> <span class="mf">0.9476157267888387</span>
<span class="n">EUCLIDEAN</span> <span class="n">DISTANCES</span> <span class="o">---</span>
<span class="n">RANDOM</span><span class="p">:</span> <span class="mf">9.385317516326904</span>
<span class="n">DIVERSE</span><span class="p">:</span> <span class="mf">10.03541965484619</span>
<span class="n">NEAREST</span><span class="p">:</span> <span class="mf">8.910124270121257</span>
</code></pre></div></div>
<p>Hope you enjoyed this post and find some nice applications for sampling with DPPs!</p>
<h2 id="references">References</h2>
<ol>
<li><strong>Determinantal point processes for machine learning</strong> <a name="dpp_ml"><a href="https://arxiv.org/abs/1907.11692">PDF</a></a><br />Alex Kulesza and Ben Taskar, 2012</li>
<li><strong>k-DPPs: Fixed-Size Determinantal Point Processes</strong> <a name="kdpp"><a href="https://www.alexkulesza.com/pubs/kdpps_icml11.pdf">PDF</a></a><br />Alex Kulesza and Ben Taskar, 2011</li>
<li><strong>DPPy: Sampling DPPs with Python</strong> <a name="dppy_paper"><a href="https://arxiv.org/abs/1809.07258">PDF</a></a><br />Guillaume Gautier, Guillermo Polito, Rémi Bardenet and Michal Valko, 2018
, 2018</li>
<li><strong>Python library for sampling Determinantal Point Processes</strong> <a name="dppy_git"><a href="https://github.com/guilgautier/DPPy">Link</a></a><br />Guillaume Gautier, Guillermo Polito, Rémi Bardenet and Michal Valko, 2018</li>
<li><strong>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</strong> <a name="bert"><a href="https://arxiv.org/abs/1810.04805">PDF</a></a><br />Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova, 2018</li>
<li><strong>Flair - A very simple framework for state-of-the-art Natural Language Processing (NLP)</strong> <a name="flair"><a href="https://github.com/zalandoresearch/flair">Link</a></a><br />Zalando Research</li>
<li><strong>scikit-learn - Machine Learning in Python</strong> <a name="sklearn"><a href="https://scikit-learn.org/stable/">Link</a></a></li>
</ol>Roberto Silveirarsilveira79@gmail.comSampling diverse NeurIPS papers using Determinantal Point Process (DPP)Text classification with RoBERTa2019-08-19T21:21:41+00:002019-08-19T21:21:41+00:00https://rsilveira79.github.io/fermenting_gradients/machine_learning/nlp/pytorch/text_classification_roberta<h1 id="fine-tuning-pytorch-transformers-for-sequenceclassificatio">Fine-tuning pytorch-transformers for SequenceClassificatio</h1>
<p>As mentioned already in earlier <a href="https://rsilveira79.github.io/fermenting_gradients/machine_learning/nlp/pytorch/pytorch-transformer-squad/">post</a>, I’m a big fan of the work that the <a href="http://huggingface.co">Hugging Face</a> is doing to make available latest models to the community.
Very recently, they made available Facebook RoBERTa: <em>A Robustly Optimized BERT Pretraining Approach</em><sup> <a href="#roberta">1</a></sup>. Facebook team proposed several improvements on top of BERT <sup> <a href="#bert">2</a></sup>, with the main assumption tha BERT model was <em>“significantly undertrained”</em>. The modification over BERT include:</p>
<ol>
<li>training the model longer, with bigger batches;</li>
<li>removing the next sentence prediction objective;</li>
<li>training on longer sequences;</li>
<li>dynamically changing the masking pattern applied to the training data;</li>
</ol>
<p>More details can be found in the <a href="#bert">paper</a>, we will focus here on a practical application of RoBERTa model using <code class="language-plaintext highlighter-rouge">pytorch-transformers</code>library: text classification.
For this practical application, we are going to use the SNIPs NLU (Natural Language Understanding) dataset <sup> <a href="#snips">3</a></sup>.</p>
<h2 id="nlu-dataset">NLU Dataset</h2>
<p>The NLU dataset is composed by several intents, for this post we are going to use <code class="language-plaintext highlighter-rouge">2017-06-custom-intent-engines</code> dataset, that is composed by 7 classes:</p>
<ul>
<li><strong>SearchCreativeWork</strong> (e.g. Find me the I, Robot television show);</li>
<li><strong>GetWeather</strong> (e.g. Is it windy in Boston, MA right now?);</li>
<li><strong>BookRestaurant</strong> (e.g. I want to book a highly rated restaurant for me and my boyfriend tomorrow night);</li>
<li><strong>PlayMusic</strong> (e.g. Play the last track from Beyoncé off Spotify);</li>
<li><strong>AddToPlaylist</strong> (e.g. Add Diamonds to my roadtrip playlist);</li>
<li><strong>RateBook</strong> (e.g. Give 6 stars to Of Mice and Men);</li>
<li><strong>SearchScreeningEvent</strong> (e.g. Check the showtimes for Wonder Woman in Paris);</li>
</ul>
<h2 id="pytorch-transformers-robertaforsequenceclassification">pytorch-transformers <code class="language-plaintext highlighter-rouge">RobertaForSequenceClassification</code></h2>
<p>As described in earlier <a href="https://rsilveira79.github.io/fermenting_gradients/machine_learning/nlp/pytorch/pytorch-transformer-squad/">post</a>, <code class="language-plaintext highlighter-rouge">pytorch-transormers</code> base their API in some main classes, and here it wasn’t different:</p>
<ul>
<li>RobertaConfig</li>
<li>RobertaTokenizer</li>
<li>RobertaModel</li>
</ul>
<p>All the code on this post can be found in this Colab notebook:<br />
<a href="https://colab.research.google.com/drive/1xg4UMQmXjDik3v9w-dAsk4kq7dXX_0Fm" class="btn btn--success">Text Classification with RoBERTa</a></p>
<p>First things first, we need to import RoBERTa from <code class="language-plaintext highlighter-rouge">pytorch-transformers</code>, making sure that we are using latest release <strong><font color="red">1.1.0</font></strong>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pytorch_transformers</span> <span class="kn">import</span> <span class="n">RobertaModel</span><span class="p">,</span> <span class="n">RobertaTokenizer</span>
<span class="kn">from</span> <span class="nn">pytorch_transformers</span> <span class="kn">import</span> <span class="n">RobertaForSequenceClassification</span><span class="p">,</span> <span class="n">RobertaConfig</span>
<span class="n">config</span> <span class="o">=</span> <span class="n">RobertaConfig</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'roberta-base'</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">RobertaTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'roberta-base'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">RobertaForSequenceClassification</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
</code></pre></div></div>
<p>As the NLU dataset has 7 classes (labels), we need to set this in the RoBERTa configuration:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">config</span><span class="p">.</span><span class="n">num_labels</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">label_to_ix</span><span class="p">.</span><span class="n">values</span><span class="p">()))</span>
</code></pre></div></div>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"attention_probs_dropout_prob"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"finetuning_task"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
</span><span class="nl">"hidden_act"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gelu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"hidden_dropout_prob"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"hidden_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">768</span><span class="p">,</span><span class="w">
</span><span class="nl">"initializer_range"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.02</span><span class="p">,</span><span class="w">
</span><span class="nl">"intermediate_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">3072</span><span class="p">,</span><span class="w">
</span><span class="nl">"layer_norm_eps"</span><span class="p">:</span><span class="w"> </span><span class="mi">1e-12</span><span class="p">,</span><span class="w">
</span><span class="nl">"max_position_embeddings"</span><span class="p">:</span><span class="w"> </span><span class="mi">514</span><span class="p">,</span><span class="w">
</span><span class="nl">"num_attention_heads"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="p">,</span><span class="w">
</span><span class="nl">"num_hidden_layers"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="p">,</span><span class="w">
</span><span class="nl">"num_labels"</span><span class="p">:</span><span class="w"> </span><span class="mi">7</span><span class="p">,</span><span class="w">
</span><span class="nl">"output_attentions"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"output_hidden_states"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"torchscript"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"type_vocab_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"vocab_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">50265</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>In this notebook, I used the nice Colab GPU feature, so all the boilerplate code with <code class="language-plaintext highlighter-rouge">.cuda()</code> is there. Make sure you have the correct device specified [<code class="language-plaintext highlighter-rouge">cpu</code>, <code class="language-plaintext highlighter-rouge">cuda</code>] when running/training the classifier.</p>
<p>I fine-tuned the classifier for <strong>3</strong> epochs, using <code class="language-plaintext highlighter-rouge">learning_rate</code>= <strong>1e-05</strong>, with <code class="language-plaintext highlighter-rouge">Adam</code> optimizer and <code class="language-plaintext highlighter-rouge">nn.CrossEntropyLoss()</code>. Depending on the dataset you are dealing, these parameters need to be changed. After the <strong>3</strong> epochs, the train accuracy was <strong>~ 98%</strong>, which is fine considering a small dataset (and probably a bit of overfitting as well).</p>
<p>Here are some results I got using the fine-tuned model with <code class="language-plaintext highlighter-rouge">RobertaForSequenceClassification</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">get_reply</span><span class="p">(</span><span class="s">"play radiohead song"</span><span class="p">)</span>
<span class="s">'PlayMusic'</span>
<span class="n">get_reply</span><span class="p">(</span><span class="s">"it is rainy in Sao Paulo"</span><span class="p">)</span>
<span class="s">'GetWeather'</span>
<span class="n">get_reply</span><span class="p">(</span><span class="s">"Book tacos for me tonight"</span><span class="p">)</span>
<span class="s">'BookRestaurant'</span>
<span class="n">get_reply</span><span class="p">(</span><span class="s">"Book a table for me tonight"</span><span class="p">)</span>
<span class="s">'BookRestaurant'</span>
</code></pre></div></div>
<p>RoBERTo hopes you have enjoyed RoBERTa 😁and you can use it in your projects!</p>
<h2 id="references">References</h2>
<ol>
<li><strong>RoBERTa: A Robustly Optimized BERT Pretraining Approach</strong> <a name="roberta"><a href="https://arxiv.org/abs/1907.11692">PDF</a></a><br />Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov, 2019</li>
<li><strong>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</strong> <a name="bert"><a href="https://arxiv.org/abs/1810.04805">PDF</a></a><br />Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova, 2018</li>
<li><strong>Natural Language Understanding benchmark</strong> <a name="snips"><a href="https://github.com/snipsco/nlu-benchmark">Link</a></a><br />Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, Joseph Dureau, 2018</li>
</ol>Roberto Silveirarsilveira79@gmail.comFine-tuning pytorch-transformers for SequenceClassificatioMachine Comprehension with pytorch-transformers2019-08-18T21:21:41+00:002019-08-18T21:21:41+00:00https://rsilveira79.github.io/fermenting_gradients/machine_learning/nlp/pytorch/pytorch-transformer-squad<h1 id="step-by-step-guide-to-finetune-and-use-question-and-answering-models-with-pytorch-transformers">Step-by-step guide to finetune and use question and answering models with pytorch-transformers</h1>
<p>I have used question and answering systems for some time now, and I’m really impressed how these algorithms evolved recently. My first interaction with QA algorithms was with the BiDAF model (Bidirectional Attention Flow) <sup> <a href="#bidaf">1</a></sup> from the great <a href="https://allennlp.org/">AllenNLP</a> team. It was back in 2017, and ELMo embeddings <sup> <a href="#elmo">2</a></sup> were not even used in this BiDAF model (I believe they were using GLove vectors in this first model). Since then, a lot of stuff is happened in the NLP arena, such as the Transformer <sup> <a href="#transformer">3</a></sup>, BERT <sup> <a href="#bert">4</a></sup> and the many other members of the Sesame Street family (now there are a whole BERT-like-family such as Facebook RoBERTa <sup> <a href="#roberta">4</a></sup>, VilBERT and maybe(why not?) one day, DilBERT).</p>
<p>There are lots of great materias out there (see <a href="#more">Probe Further</a> section for more details), so it will be much easier to go on and watch these awesome video materials instead of detailing each model in a blog post.</p>
<p>I would really want to spend time in the practical usage of question and answering models, as they can be very helpful for real-life applications (besides some challenges that will be addressed in other posts - such as model size, response time, model quantization/pruning, etc).</p>
<p>In this regard, all the ML community should give a massive shout-out to <a href="http://huggingface.co">Hugging Face</a> team. They are really pushing the limits to make the latest and greatest algorithms available for the broader community, and it is really cool to see how their project is growing rapidly in github (at the time I’m writing this they already surpassed more than 10k ⭐️on github for the <a href="https://github.com/huggingface/pytorch-transformers">pytorch-transformer</a> repo, for example). I will focus on <a href="https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/">SQuAD 1.1</a> dataset, more details on how fine-tune/use these models with SQuAD 2.0 dataset will be described in further posts.</p>
<h2 id="inside-pytorch-transformers">Inside pytorch-transformers</h2>
<p>The <code class="language-plaintext highlighter-rouge">pytorch-transformers</code> lib has some special classes, and the nice thing is that they try to be consistent with this architecture independently of the model (BERT, XLNet, RoBERTa, etc). These 3 important classes are:</p>
<blockquote>
<p><strong>Config</strong> \(\rightarrow\) this is the class that defines all the configurations of the model in hand, such as number of hidden layers in Transformer, number of attention heads in the Transformer encoder, activation function, dropout rate, etc. Usually, there are 2 <em>default</em> configurations [<code class="language-plaintext highlighter-rouge">base</code>, <code class="language-plaintext highlighter-rouge">large</code>], but it is possible to tune the configurations to have different models. The file format of the configuration file is a <code class="language-plaintext highlighter-rouge">.json</code> file.</p>
</blockquote>
<blockquote>
<p><strong>Tokenizer</strong> \(\rightarrow\) the tokenizer class deals with some linguistic details of each model class, as specific tokenization types are used (such as WordPiece for BERT or SentencePiece for XLNet). It also handles begin-of-sentence (bos), end-of-sentence (eod), unknown, separation, padding, mask and any other special tokens. The tokenizer file can be loaded as a <code class="language-plaintext highlighter-rouge">.txt</code> file.</p>
</blockquote>
<blockquote>
<p><strong>Model</strong> \(\rightarrow\) finally, we need to specify the model class. In this specific case, we are going to use special classes for Question and Answering [<code class="language-plaintext highlighter-rouge">BertForQuestionAnswering</code>, <code class="language-plaintext highlighter-rouge">XLNetForQuestionAnswering</code>], but there are other classes for different downstream tasks that can be used. These downstream classes inherit [<code class="language-plaintext highlighter-rouge">BertModel</code>, <code class="language-plaintext highlighter-rouge">XLNetModel</code>] classes, which will then go into more specific details (embedding type, Transformer configuration, etc). The weights of a fine-tuned downstream task mode are stored in a <code class="language-plaintext highlighter-rouge">.bin</code> file.</p>
</blockquote>
<h2 id="download-fine-tuned-models">Download Fine-tuned models</h2>
<p><a href="https://drive.google.com/open?id=1OnvT5sKgi0WVWTXnTaaOPTE5KIh-xg_E" class="btn btn--success">BERT Model for SQuAD 1.1</a>
<a href="https://drive.google.com/open?id=1e7wu9yI-rGkSzjoPU2TpCC9FMvlKvl8R" class="btn btn--warning">XLNet Model for SQuAD 1.1</a></p>
<p class="notice--primary"><strong>Watch out!</strong> The BERT model I downloaded directly from <a href="https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py">Hugging Face</a> repo, the XLNet model I fine-tuned myself for 3 epochs in a Nvidia 1080ti. Also, I noticed that the XLNet model maybe needs some more training - see <a href="#results">Results</a> section</p>
<h2 id="finetuning-scripts">Finetuning scripts</h2>
<p>To run the fine-tuning scripts, the Hugging Face team makes available some dataset-specific files that can be found <a href="https://github.com/huggingface/pytorch-transformers/tree/master/examples">here</a>.
These fine-tuning scripts can be highly customizable, for example by passing a config file for a model specified in <code class="language-plaintext highlighter-rouge">.json</code> file e.g. <code class="language-plaintext highlighter-rouge">--config_name xlnet_m2.jsonn</code>.
The examples below are showing BERT finetuning with <code class="language-plaintext highlighter-rouge">base</code> configuration, and <code class="language-plaintext highlighter-rouge">xlnet</code> configuration with specific parameters (<code class="language-plaintext highlighter-rouge">n_head</code>,<code class="language-plaintext highlighter-rouge">n_layer</code>). The models provided for download both use the <code class="language-plaintext highlighter-rouge">large</code> config.</p>
<h3 id="finetuning-bert">Finetuning BERT</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">python</span> <span class="n">run_squad</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">model_type</span> <span class="n">bert</span> \
<span class="o">--</span><span class="n">model_name_or_path</span> <span class="n">bert</span><span class="o">-</span><span class="n">base</span><span class="o">-</span><span class="n">cased</span> \
<span class="o">--</span><span class="n">do_train</span> \
<span class="o">--</span><span class="n">do_eval</span> \
<span class="o">--</span><span class="n">evaluate_during_training</span> \
<span class="o">--</span><span class="n">do_lower_case</span> \
<span class="o">--</span><span class="n">train_file</span> <span class="err">$</span><span class="n">SQUAD_DIR</span><span class="o">/</span><span class="n">train</span><span class="o">-</span><span class="n">v1</span><span class="p">.</span><span class="mf">1.j</span><span class="n">son</span> \
<span class="o">--</span><span class="n">predict_file</span> <span class="err">$</span><span class="n">SQUAD_DIR</span><span class="o">/</span><span class="n">dev</span><span class="o">-</span><span class="n">v1</span><span class="p">.</span><span class="mf">1.j</span><span class="n">son</span> \
<span class="o">--</span><span class="n">save_steps</span> <span class="mi">10000</span> \
<span class="o">--</span><span class="n">learning_rate</span> <span class="mf">3e-5</span> \
<span class="o">--</span><span class="n">num_train_epochs</span> <span class="mf">5.0</span> \
<span class="o">--</span><span class="n">max_seq_length</span> <span class="mi">384</span> \
<span class="o">--</span><span class="n">doc_stride</span> <span class="mi">128</span> \
<span class="o">--</span><span class="n">output_dir</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">roberto</span><span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">finetuned_xlnet</span> \
<span class="o">--</span><span class="n">overwrite_output_dir</span> \
<span class="o">--</span><span class="n">overwrite_cache</span>
</code></pre></div></div>
<h3 id="finetuning-xlnet">Finetuning XLNet</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">python</span> <span class="o">-</span><span class="n">u</span> <span class="n">run_squad</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">model_type</span> <span class="n">xlnet</span> \
<span class="o">--</span><span class="n">model_name_or_path</span> <span class="n">xlnet</span><span class="o">-</span><span class="n">large</span><span class="o">-</span><span class="n">cased</span> \
<span class="o">--</span><span class="n">do_train</span> \
<span class="o">--</span><span class="n">do_eval</span> \
<span class="o">--</span><span class="n">config_name</span> <span class="n">xlnet_m2</span><span class="p">.</span><span class="n">json</span> \
<span class="o">--</span><span class="n">evaluate_during_training</span> \
<span class="o">--</span><span class="n">do_lower_case</span> \
<span class="o">--</span><span class="n">train_file</span> <span class="err">$</span><span class="n">SQUAD_DIR</span><span class="o">/</span><span class="n">train</span><span class="o">-</span><span class="n">v1</span><span class="p">.</span><span class="mf">1.j</span><span class="n">son</span> \
<span class="o">--</span><span class="n">predict_file</span> <span class="err">$</span><span class="n">SQUAD_DIR</span><span class="o">/</span><span class="n">dev</span><span class="o">-</span><span class="n">v1</span><span class="p">.</span><span class="mf">1.j</span><span class="n">son</span> \
<span class="o">--</span><span class="n">save_steps</span> <span class="mi">10000</span> \
<span class="o">--</span><span class="n">learning_rate</span> <span class="mf">3e-5</span> \
<span class="o">--</span><span class="n">num_train_epochs</span> <span class="mf">5.0</span> \
<span class="o">--</span><span class="n">max_seq_length</span> <span class="mi">384</span> \
<span class="o">--</span><span class="n">doc_stride</span> <span class="mi">128</span> \
<span class="o">--</span><span class="n">per_gpu_train_batch_size</span> <span class="mi">1</span> \
<span class="o">--</span><span class="n">output_dir</span> <span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">roberto</span><span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">finetuned_xlnet</span> \
<span class="o">--</span><span class="n">overwrite_output_dir</span> \
<span class="o">--</span><span class="n">overwrite_cache</span>
</code></pre></div></div>
<p>Config <code class="language-plaintext highlighter-rouge">xlnet_m2.json</code></p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"attn_type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bi"</span><span class="p">,</span><span class="w">
</span><span class="nl">"bi_data"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"clamp_len"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="p">,</span><span class="w">
</span><span class="nl">"d_head"</span><span class="p">:</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w">
</span><span class="nl">"d_inner"</span><span class="p">:</span><span class="w"> </span><span class="mi">4096</span><span class="p">,</span><span class="w">
</span><span class="nl">"d_model"</span><span class="p">:</span><span class="w"> </span><span class="mi">1024</span><span class="p">,</span><span class="w">
</span><span class="nl">"dropatt"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"dropout"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"end_n_top"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
</span><span class="nl">"ff_activation"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gelu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"finetuning_task"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
</span><span class="nl">"init"</span><span class="p">:</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">,</span><span class="w">
</span><span class="nl">"init_range"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"init_std"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.02</span><span class="p">,</span><span class="w">
</span><span class="nl">"initializer_range"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.02</span><span class="p">,</span><span class="w">
</span><span class="nl">"layer_norm_eps"</span><span class="p">:</span><span class="w"> </span><span class="mi">1e-12</span><span class="p">,</span><span class="w">
</span><span class="nl">"max_position_embeddings"</span><span class="p">:</span><span class="w"> </span><span class="mi">512</span><span class="p">,</span><span class="w">
</span><span class="nl">"mem_len"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
</span><span class="nl">"n_head"</span><span class="p">:</span><span class="w"> </span><span class="mi">16</span><span class="p">,</span><span class="w">
</span><span class="nl">"n_layer"</span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="p">,</span><span class="w">
</span><span class="nl">"n_token"</span><span class="p">:</span><span class="w"> </span><span class="mi">32000</span><span class="p">,</span><span class="w">
</span><span class="nl">"num_labels"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w">
</span><span class="nl">"output_attentions"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"output_hidden_states"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"reuse_len"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
</span><span class="nl">"same_length"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"start_n_top"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
</span><span class="nl">"summary_activation"</span><span class="p">:</span><span class="w"> </span><span class="s2">"tanh"</span><span class="p">,</span><span class="w">
</span><span class="nl">"summary_last_dropout"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"summary_type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"last"</span><span class="p">,</span><span class="w">
</span><span class="nl">"summary_use_proj"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"torchscript"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"untie_r"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="using-the-trained-models">Using the trained models</h2>
<p>Now to the fun part: using these models for question and answering!</p>
<p>First things first, let’s import the model classes from <code class="language-plaintext highlighter-rouge">pytorch-transformers</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">pytorch_transformers</span> <span class="kn">import</span> <span class="n">BertConfig</span><span class="p">,</span> <span class="n">BertTokenizer</span><span class="p">,</span> <span class="n">BertForQuestionAnswering</span>
<span class="kn">from</span> <span class="nn">pytorch_transformers</span> <span class="kn">import</span> <span class="n">XLNetConfig</span><span class="p">,</span> <span class="n">XLNetForQuestionAnswering</span><span class="p">,</span> <span class="n">XLNetTokenizer</span>
</code></pre></div></div>
<p>These are the 3 important classes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MODEL_CLASSES</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'bert'</span><span class="p">:</span> <span class="p">(</span><span class="n">BertConfig</span><span class="p">,</span> <span class="n">BertForQuestionAnswering</span><span class="p">,</span> <span class="n">BertTokenizer</span><span class="p">),</span>
<span class="s">'xlnet'</span><span class="p">:</span> <span class="p">(</span><span class="n">XLNetConfig</span><span class="p">,</span> <span class="n">XLNetForQuestionAnswering</span><span class="p">,</span> <span class="n">XLNetTokenizer</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I’ve made this special class to handles all the feature preparation and output formating for both BERT and XLNet, but this could be done in different ways:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">QuestionAnswering</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config_file</span><span class="p">,</span> <span class="n">weight_file</span><span class="p">,</span> <span class="n">tokenizer_file</span><span class="p">,</span> <span class="n">model_type</span> <span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">config_class</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">model_class</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer_class</span> <span class="o">=</span> <span class="n">MODEL_CLASSES</span><span class="p">[</span><span class="n">model_type</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config_class</span><span class="p">.</span><span class="n">from_json_file</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model_class</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">load_state_dict</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">weight_file</span><span class="p">,</span> <span class="n">map_location</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">device</span><span class="p">))</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer_class</span><span class="p">(</span><span class="n">tokenizer_file</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">model_type</span> <span class="o">=</span> <span class="n">model_type</span>
<span class="k">def</span> <span class="nf">to_list</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tensor</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tensor</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">cpu</span><span class="p">().</span><span class="n">tolist</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">get_reply</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">passage</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">input_ids</span><span class="p">,</span> <span class="n">_</span> <span class="p">,</span> <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">prepare_features</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">passage</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">model_type</span> <span class="o">==</span> <span class="s">'bert'</span><span class="p">:</span>
<span class="n">span_start</span><span class="p">,</span><span class="n">span_end</span><span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">span_start</span><span class="p">):</span><span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">span_end</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="n">answer</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bert_convert_tokens_to_string</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
<span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">model_type</span> <span class="o">==</span> <span class="s">'xlnet'</span><span class="p">:</span>
<span class="n">input_vector</span> <span class="o">=</span> <span class="p">{</span><span class="s">'input_ids'</span><span class="p">:</span> <span class="n">input_ids</span><span class="p">,</span>
<span class="s">'start_positions'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span>
<span class="s">'end_positions'</span><span class="p">:</span> <span class="bp">None</span> <span class="p">}</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">input_vector</span><span class="p">)</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">to_list</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">1</span><span class="p">])[</span><span class="mi">0</span><span class="p">][</span><span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">])]:</span><span class="bp">self</span><span class="p">.</span><span class="n">to_list</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">3</span><span class="p">])[</span><span class="mi">0</span><span class="p">][</span><span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">2</span><span class="p">])]</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="n">answer</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">xlnet_convert_tokens_to_string</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
<span class="k">return</span> <span class="n">answer</span>
<span class="k">def</span> <span class="nf">bert_convert_tokens_to_string</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tokens</span><span class="p">):</span>
<span class="n">out_string</span> <span class="o">=</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">' ##'</span><span class="p">,</span> <span class="s">''</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="k">if</span> <span class="s">'@'</span> <span class="ow">in</span> <span class="n">tokens</span><span class="p">:</span>
<span class="n">out_string</span> <span class="o">=</span> <span class="n">out_string</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
<span class="k">return</span> <span class="n">out_string</span>
<span class="k">def</span> <span class="nf">xlnet_convert_tokens_to_string</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tokens</span><span class="p">):</span>
<span class="n">out_string</span> <span class="o">=</span> <span class="s">''</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">'▁'</span><span class="p">,</span> <span class="s">' '</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="k">return</span> <span class="n">out_string</span>
<span class="k">def</span> <span class="nf">prepare_features</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">passage</span><span class="p">,</span> <span class="n">max_seq_length</span> <span class="o">=</span> <span class="mi">300</span><span class="p">,</span>
<span class="n">zero_pad</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span> <span class="n">include_CLS_token</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">include_SEP_token</span> <span class="o">=</span> <span class="bp">True</span><span class="p">):</span>
<span class="c1">## Tokenzine Input
</span> <span class="n">tokens_a</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
<span class="n">tokens_b</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">passage</span><span class="p">)</span>
<span class="c1">## Truncate
</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens_a</span><span class="p">)</span> <span class="o">></span> <span class="n">max_seq_length</span> <span class="o">-</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">tokens_a</span> <span class="o">=</span> <span class="n">tokens_a</span><span class="p">[</span><span class="mi">0</span><span class="p">:(</span><span class="n">max_seq_length</span> <span class="o">-</span> <span class="mi">2</span><span class="p">)]</span>
<span class="c1">## Initialize Tokens
</span> <span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">if</span> <span class="n">include_CLS_token</span><span class="p">:</span>
<span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">cls_token</span><span class="p">)</span>
<span class="c1">## Add Tokens and separators
</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokens_a</span><span class="p">:</span>
<span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="k">if</span> <span class="n">include_SEP_token</span><span class="p">:</span>
<span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">sep_token</span><span class="p">)</span>
<span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokens_b</span><span class="p">:</span>
<span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="c1">## Convert Tokens to IDs
</span> <span class="n">input_ids</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">convert_tokens_to_ids</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="c1">## Input Mask
</span> <span class="n">input_mask</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span>
<span class="c1">## Zero-pad sequence lenght
</span> <span class="k">if</span> <span class="n">zero_pad</span><span class="p">:</span>
<span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span> <span class="o"><</span> <span class="n">max_seq_length</span><span class="p">:</span>
<span class="n">input_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">input_mask</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">input_ids</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">input_mask</span><span class="p">,</span> <span class="n">tokens</span>
</code></pre></div></div>
<p>Finally we just need to instantiate these models and start using them!</p>
<p>BERT:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bert</span> <span class="o">=</span> <span class="n">QuestionAnswering</span><span class="p">(</span>
<span class="n">config_file</span> <span class="o">=</span> <span class="s">'bert-large-cased-whole-word-masking-finetuned-squad-config.json'</span><span class="p">,</span>
<span class="n">weight_file</span><span class="o">=</span> <span class="s">'bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin'</span><span class="p">,</span>
<span class="n">tokenizer_file</span><span class="o">=</span> <span class="s">'bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt'</span><span class="p">,</span>
<span class="n">model_type</span> <span class="o">=</span> <span class="s">'bert'</span>
<span class="p">)</span>
</code></pre></div></div>
<p>XLNet:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xlnet</span> <span class="o">=</span> <span class="n">QuestionAnswering</span><span class="p">(</span>
<span class="n">config_file</span> <span class="o">=</span> <span class="s">'xlnet-cased-finetuned-squad.json'</span><span class="p">,</span>
<span class="n">weight_file</span><span class="o">=</span> <span class="s">'xlnet-cased-finetuned-squad.bin'</span><span class="p">,</span>
<span class="n">tokenizer_file</span><span class="o">=</span> <span class="s">'xlnet-large-cased-spiece.txt'</span><span class="p">,</span>
<span class="n">model_type</span> <span class="o">=</span> <span class="s">'xlnet'</span>
<span class="p">)</span>
</code></pre></div></div>
<h2 id="results">Results<a name="results"></a></h2>
<p>I’ve included some sample <code class="language-plaintext highlighter-rouge">facts</code> and <code class="language-plaintext highlighter-rouge">questions</code> to give these algorithms a go:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">facts</span> <span class="o">=</span> <span class="s">" My wife is great. </span><span class="se">\
</span><span class="s">My complete name is Roberto Pereira Silveira. </span><span class="se">\
</span><span class="s">I am 40 years old. </span><span class="se">\
</span><span class="s">My dog is cool. </span><span class="se">\
</span><span class="s">My dog breed is jack russel. </span><span class="se">\
</span><span class="s">My dog was born in 2014.</span><span class="se">\
</span><span class="s">My dog name is Mallu. </span><span class="se">\
</span><span class="s">My dog is 5 years old. </span><span class="se">\
</span><span class="s">I am an engineer. </span><span class="se">\
</span><span class="s">I was born in 1979. </span><span class="se">\
</span><span class="s">My e-mail is rsilveira79@gmail.com."</span>
<span class="n">questions</span> <span class="o">=</span> <span class="p">[</span>
<span class="s">"What is my complete name?"</span><span class="p">,</span>
<span class="s">"What is dog name?"</span><span class="p">,</span>
<span class="s">"What is my dog age?"</span><span class="p">,</span>
<span class="s">"What is my age?"</span><span class="p">,</span>
<span class="s">"What is my dog breed?"</span><span class="p">,</span>
<span class="s">"When I was born?"</span><span class="p">,</span>
<span class="s">"What is my e-mail?"</span>
<span class="p">]</span>
</code></pre></div></div>
<p>And here are the results! As you could see I should have trained XLNet a bit more, but it is already returning good results:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">my</span> <span class="n">complete</span> <span class="n">name</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="n">roberto</span> <span class="n">pereira</span> <span class="n">silveira</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="n">Roberto</span> <span class="n">Pereira</span> <span class="n">Silveira</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">dog</span> <span class="n">name</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="n">mallu</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="n">Roberto</span> <span class="n">Pereira</span> <span class="n">Silveira</span><span class="p">.</span> <span class="n">I</span> <span class="n">am</span> <span class="mi">40</span> <span class="n">years</span> <span class="n">old</span><span class="p">.</span> <span class="n">My</span> <span class="n">dog</span> <span class="ow">is</span> <span class="n">cool</span><span class="p">.</span> <span class="n">My</span> <span class="n">dog</span> <span class="n">breed</span> <span class="ow">is</span> <span class="n">jack</span> <span class="n">russel</span><span class="p">.</span> <span class="n">My</span> <span class="n">dog</span> <span class="n">was</span> <span class="n">born</span> <span class="ow">in</span> <span class="mf">2014.</span><span class="n">My</span> <span class="n">dog</span> <span class="n">name</span> <span class="ow">is</span> <span class="n">Mallu</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">my</span> <span class="n">dog</span> <span class="n">age</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="mi">5</span> <span class="n">years</span> <span class="n">old</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="mi">40</span> <span class="n">years</span> <span class="n">old</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">my</span> <span class="n">age</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="mi">40</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="mi">40</span> <span class="n">years</span> <span class="n">old</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">my</span> <span class="n">dog</span> <span class="n">breed</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="n">jack</span> <span class="n">russel</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="n">jack</span> <span class="n">russel</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">When</span> <span class="n">I</span> <span class="n">was</span> <span class="n">born</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="mi">1979</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="mi">1979</span>
<span class="o">--------------------------------------------------</span>
<span class="n">QUESTION</span><span class="p">:</span> <span class="n">What</span> <span class="ow">is</span> <span class="n">my</span> <span class="n">e</span><span class="o">-</span><span class="n">mail</span><span class="err">?</span>
<span class="n">BERT</span><span class="p">:</span> <span class="n">rsilveira79</span><span class="o">@</span><span class="n">gmail</span><span class="p">.</span><span class="n">com</span>
<span class="n">XLNET</span><span class="p">:</span> <span class="n">rsilveira79</span><span class="o">@</span><span class="n">gmail</span><span class="p">.</span><span class="n">com</span>
<span class="o">--------------------------------------------------</span>
</code></pre></div></div>
<p>Hope you enjoyed and till the next post!</p>
<h2 id="references">References</h2>
<ol>
<li><strong>Bidirectional Attention Flow for Machine Comprehension</strong> <a name="bidaf"><a href="https://arxiv.org/abs/1611.01603">PDF</a></a><br />Minjoon Seo and Aniruddha Kembhavi and Ali Farhadi and Hannaneh Hajishirzi, 2016</li>
<li><strong>Deep contextualized word representations</strong> <a name="elmo"><a href="https://arxiv.org/abs/1802.05365">PDF</a></a><br />Peters, Matthew and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke, 2018</li>
<li><strong>Attention Is All You Need</strong> <a name="transformer"><a href="https://arxiv.org/abs/1706.03762">PDF</a></a><br />Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin, 2017</li>
<li><strong>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</strong> <a name="bert"><a href="https://arxiv.org/abs/1810.04805">PDF</a></a><br />Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova, 2018</li>
<li><strong>RoBERTa: A Robustly Optimized BERT Pretraining Approach</strong> <a name="roberta"><a href="https://arxiv.org/abs/1907.11692">PDF</a></a><br />Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov, 2019</li>
</ol>
<h2 id="to-probe-further">To Probe Further<a name="more"></a></h2>
<ol>
<li><strong>The Illustrated Transformer</strong> <a name="illustrated_transformer"><a href="http://jalammar.github.io/illustrated-transformer/">Link</a></a><br />Jay Alammar</li>
<li><strong>Stanford CS224n NLP Class w/Ashish Vaswani & Anna Huang</strong> <a name="cs224n_transformer"><a href="https://youtu.be/5vcj8kSwBCY">Link</a></a><br />Professor Christopher Manning</li>
<li><strong>ELMo - Paper Explained</strong> <a name="elmo_explained"><a href="https://www.youtube.com/watch?v=9JfGxKkmBc0">Link</a></a><br />ML Papers Explained - A.I. Socratic Circles - AISC</li>
<li><strong>Transformer - Paper Explained</strong> <a name="transformer_explained"><a href="https://www.youtube.com/watch?v=S0KakHcj_rs">Link</a></a><br />ML Papers Explained - A.I. Socratic Circles - AISC</li>
<li><strong>BERT - Paper Explained</strong> <a name="bert_explained"><a href="https://youtu.be/BhlOGGzC0Q00">Link</a></a><br />ML Papers Explained - A.I. Socratic Circles - AISC</li>
<li><strong>XLNet - Paper Explained</strong> <a name="xlnet_explained"><a href="https://www.youtube.com/watch?v=Mgck4XFR9GA">Link</a></a><br />ML Papers Explained - A.I. Socratic Circles - AISC</li>
</ol>Roberto Silveirarsilveira79@gmail.comStep-by-step guide to finetune and use question and answering models with pytorch-transformers