Readings: Deep Learning and Computational Linguistics

 6 min read

technical content literature review papers

Here's a small peak of the ideas that excite me right now.

Papers have short half lives. I'm quite new to this area of research, so I focus more where ideas are going, rather than where they are.

The first part covers predominantly deep learning approaches to language understanding. The Second part is includes more linguistically informed methods.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Better text encoders come out of attempting GAN's for text.
See also: Paper Review by Mark Neumann

ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN.
[...] Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

What stood out to me:
Electra is about 30 times* faster in training than BERT. This means that a startup, after Series-A funding, can afford the compute to train a language model for their domain from scratch ... given they own a huge corpus that public pre-trained models haven't seen, which is unlikely.

Hopfield Networks is All You Need

700 Github ★  in just two weeks

A possible replacement for the self-attention layer of transformer neural network architectures. This is the best computer science research that ever came out of  Johannes Kepler University (JKU) in Linz, Austria.

My brother is doing a CS undergrad at JKU and I'm always disappointed by how pedagogically bankrupt his course instructions and homework assignments are.

So, I was surprised when I saw JKU on this paper. It seems their (new) AI department is capable.

The transformer and BERT models pushed the performance on NLP tasks to new levels via their attention mechanism. We show that this attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors.
The Hopfield layer can be used as plug-in replacement for existing layers as well as for applications like multiple instance learning, set-based and permutation invariant learning, associative learning, and many more

Generalizing Natural Language Analysis through Span-relation Representations

Idea: Humans can analyze language in a single format, so machines might as well.

Approach: A large number of NLP subtask like NER, relation extraction, Semantic Role Labeling, sentiment analysis, ..., can be represented in a single format : spans, and relations between spans

Interesting direction, but still seems unpractical for actually building a debuggable pipeline if it's all just one module.

Understanding the Polarity of Events in the Biomedical Literature: Deep Learning vs. Linguistically-informed Methods

An important task in the machine reading of biochemical events expressed in biomedical texts is correctly reading the polarity, i.e., attributing whether the biochemical event is a promotion or an inhibition.
The best performing deep learning architecture achieves 0.968 average F1 performance in a five-fold cross-validation study, a considerable improvement over the linguistically informed model average F1 of 0.862.

Going from .86 to .96 in F1 in polarity classification is dramatic, but for a task like annotating 10 million PubMed abstracts it's not practical.

The Snake Oil of Syntax in Applied NLP

I found this post after two days of developing dependency matching rules for the high level relation extraction tool I'm building. It explained how the syntactic parsers are absolutely not as precise as reported, which is one reason why syntax based rule matching can be unreliable.

Apparently, current dependency parsers have a LAS (Labeled Attachment Score) of effectively 55% accuracy.

Mark then goes on to reframe relation extraction as a Span Prediction task and shows how this tiny conceptual change makes it easier and faster.

Extraction of causal structure from procedural text for discourse representations

Main idea: Discourse is grounded in physics and syntactic forms are implicitly causal

Not only does it [force dynamics] apply to expressions in the physical domain like leaning on or dragging, but it also plays an important role in expressions involving psychological forces (e.g. wanting or being urged). Furthermore, the concept of force dynamics can be extended to discourse. For example, the situation in which speakers A and B argue, after which speaker A gives in to speaker B, exhibits a force dynamic pattern

Linguistically, force dynamics are a good grounding framework. The paper didn't come with code or test results, so nothing of this theory might actually work.

The extraction mechanism consists minimally of these steps: identify participants of each event (e.g., predicate-argument structure), classify causal and non-causal relations between participants, classify entity qualitative state changes (or no change), and infer entity coreference links (incl. set/member and part/whole relations). The semantic classification tasks depend largely on a survey of English language data, cross-linguistic analyses, and recent experiments using transfer learning that provide evidence of the highly predictive mapping between surface syntax and causal, force-dynamic meaning
Linguist: "Let's generate these process graphs automatically". Computational Linguist: "Good Luck"

Attention Guided Graph Convolutional Networks for Relation Extraction

As often in machine learning, the idea is a special case of transforming a discrete structure, like a grammatical dependency tree, into a differentiable structure like a graph or tensor that can be handed to a model directly (see: Dependency Forest).

[... the model] directly takes full dependency trees as inputs. Our model can be understood as a soft-pruning approach that automatically learns how to selectively attend to the relevant sub-structures useful for the relation extraction task.

People usually prune dependency tree's manually, but it might be better to let the model find the best prune of a tree.

[...] rule-based pruning strategies might eliminate some important information in the full tree. Intuitively, we develop a “soft pruning” strategy that transforms the original dependency tree into a fully connected edge-weighted graph.
These weights can be viewed as the strength of relatedness between nodes, which can be learned in an end-to-end fashion by using self-attention mechanism

Enriched Dependencies

Problem: Dependency trees are about syntax and often miss direct, higher level relations between two words.

We introduce a broad-coverage, data-driven and linguistically sound set of transformations, that makes event-structure and many lexical relations explicit.


PyBart: enriched paths in green

It doesn't look like much, but it is when you're writing pattern matching rules like shot|dobj|personA shot|subj|personB to find the sentences in a corpus where someone got shot. With enrichment you'd miss the sentence from above.

With enriched dependencies, the shortest dependency path (SDP), between two entities (Sheriff and Bob) can become shorter and simpler. That means linguistic patterns overall are easier to create and have higher recall.

A Local Grammar of Cause and Effect: A Corpus-Driven Study [PhD Thesis]

From 2004, but Still Relevant
This thesis has a lot of gems about how grammars and reasoning differ by scientific domain. It includes lots of linguistic corpus analysis, including Molecular Biology.

However the field of molecular biology is fundamentally different from that of clinical narrative reports which in turn necessitates its own sublanguage grammar.
This representation makes use of essentially similar entities but needs to capture the molecular pathway relationships which are particular to the biomolecular domain.

Categories and themes for the biomolecular domain

The Golden Tablet of Themes
More Themes
Immediately relevant to the present project on causation are the actions activate, act upon, cause, generate, modify and promote all of which can be subsumed under the heading of cause in the sublanguage analysis.

And... Causation has no one Meaning (Polysemy)

Within each scientific sub-domain, scientists seek to discover the mechanisms of causal relations specific to the phenomena under observation. In fundamental particle physics, the production of electron anti-neutrinos is related causally to the decay of electron neutrinos. By way of contrast causation in genetics is frequently expressed in terms of disruption or disturbance in chemical base pairs making up DNA.