Joint embedding of words and relationships in semantic space.
Suppose we have a set of things, like words in a language, concepts in an ontology, or any other collection of objects. To reason computationally about these things/entities, or use them in modelling tasks we require a representation; a way of encoding them which is meaningful to a computer. A naïve approach might be to number all the entities, so 'aardvark' is 1 and 'zoetrope' is 140,181, but it's not very useful, and may be confusing. In this example entities are ordered alphabetically, but do we really want 'dog' to be closer to 'donkey' than to 'puppy'? Another approach is to make all entities equidistant from each other, which might make sense if we know nothing about them: should 'gabhar' be closer to 'gadhar' than 'caora'? However, this ignores the (possibly unknown) structure existing between these entities. In a good representation, similar entities should have similar representations.
The question is now: how do we know which entities are similar (without asking humans)? For language, a common approach is called distributional semantics. The idea (from Firth) is that a 'word is characterised by the company it keeps'; look at the sentences a word appears in, and you'll learn something about its meaning. If two words are completely interchangeable in a sentence, they're probably synonyms. So, choose a representation so that words appearing in similar contexts (within a sentence) have similar representations. If this seems circular, it's because it is, a little. Before we can say how similar a context is, we need the similarity of the words in that context. The way we overcome this is to start with a guess: make up a representation for each word - you can think of this as some location on a map, although in practice it's much higher dimensional - and update them as you see more sentences. Eventually, if all goes well, similar words end up clustered together.
This is great! Now when we look at the representation of 'dog', it's close to 'puppy' and 'doggy' and 'doge' and a little close to 'cat' and 'wolf', not so close to 'horse' and very far from 'sadness'. We can pass these representations into an algorithm to do something else - say, to identify which tweets are about dogs (a noble and important task), and it will 'know' that a person mentioning puppies might be talking about dogs, even if they didn't use that word. That's one of the benefits of having a good representation.
The work, however, is not yet complete. Consider the following:
Without further qualification, humans are capable of making vague similarity judgements: you probably agreed that dogs are more like puppies than they are donkeys, but now suppose we're shepherds looking for an animal to guard the flock. In this specific context (that of sheep-guarding), dogs are more like donkeys than puppies (donkeys can be employed to guard livestock, puppies should not). At the same time, puppies and kittens are both young animals, so while searching for cute gifs, either might do. We want a representation which knows about these different types of similarity.
Examples of English are very abundant on the internet. Google is especially good at describing the statistics of English, which means we can get a very accurate idea of the typical context for any given word. In other languages and domains, this may not be the case. My focus is on medicine, where the following things occur:
So we have limited access to data which may not contain all the information we need. The danger of point 2 is that our idea of a word's context becomes quite coarse. We might know that tamoxifen and imatinib are both drugs, because we see many sentences like 'started on tamoxifen/imatinib' (note that this is a sentence fragment) and have figured out that one 'starts on' drugs, but the fact that these drugs treat entirely different cancers may be lost. This is a problem.
In this project we came up with a model which can be used to learn representations which know about different notions of similarity, and which can still work even if data is limited.
We did this by extending the notion of 'company' as used by Firth; a word can keep many kinds of company: its neighbours in a sentence as well as any other words with which it has some relationship. Dogs can guard sheep, and so can donkeys, so in the context of 'guarding', they may as well be neighbours. We can then augment our dataset with statements of the form 'X is related to Y through Z' (if they exist!) to provide additional information to both:
Luckily, technical domains such as medicine tend towards categorising their knowledge in structured databases, which provides us a source of 'X is related to Y through Z' statements! (This is not a coincidence.) And as a side-effect of training our representation on such a dataset, we can actually try to extend it. Since we learn a general representation for words regardless of their source (sentences v. relationship statements), we can try to apply what we learned about the relationships to words apearing only in the sentences. That is, if 'gleevec' never appeared in a 'X is related to Y'-type statement, but we learned that gleevec is a synoynm for imatinib, then we can say with some confidence that 'gleevec is related to leukaemia through "treats"'.
We are motivated by the idea that relationships between entities in a semantic space can be modelled by geometric transformations of that space. This provides structure on the solution space which should produce more semantically meaningful representations, as well as enabling knowledge discovery by exploiting the learned transformations.
Specifically, we learn a joint generative model over (S, R, T) triples, where S and T are tokens (e.g. words in a vocabulary) and R is a possible relationship between them. Each S and T receives a vector representation, while R is an affine transformation of the vector space. True triples, corresponding to statements "S is related to T through R" have low energy, meaning the cosine distance between the T vector and the R-transformed C vector must be low. We relate this to probability using a Boltzmann distribution given this energy function, and learn the set of parameters (all vectors and transformation matrices) using stochastic maximum likelihood. We approximate gradients of the partition function using persistent contrastive divergence, obtaining model samples with Gibbs sampling from the conditional distributions for each element of the triple. Our model can further handle missing labels, such as statements like "S is related to T through some unknown relationship" by averaging weighted gradients according to the conditional probability of each possible label.
For more details, please consult the paper!
Implementation is in Python, please direct questions or issues to @corcra!