Originally published on the blog of a former employer, Idibon (long since closed), on 8th September 2015.
The key to language understanding for humans is recognizing patterns – if you find out that ‘twerk’ means a type of dancing, you automatically know that a ‘twerker’ is someone who ‘twerks’. You will also expect ‘twerking’ to occur on ‘dance floors’, in ‘clubs’, but not in ‘offices’ (there are exceptions). The contexts in which words occur allow us to have rich map of the the relationships between words.
Computational linguists are teaching computers to do exactly this: create ‘maps’ of relationships between words by looking at how they are used in similar contexts. The programs can notice that ‘man’ and ‘woman’ occur in almost identical contexts, except that ‘man’ occurs near the pronouns ‘he’ and ‘him’ and ‘woman’ occurs near the pronouns ‘she’ and ‘her’. So, on a map, if ‘man’ is southwest of ‘woman’, this represents the gender difference, and we would expect ‘king’ to be southwest of ‘queen’ by the same distance. These kinds of relationships have been researched since at least as far back as Rohde et al in 2005:
Maps of word relationships, from Douglas Rohde, Laura Gonnerman, and David
Plaut's "An Improved Model of Semantic Similarity Based on Lexical Co-
Occurence." (2005)
The field has seen the most attention recently with deep learning. Stanford professor and Idibon advisor Chris Manning recently gave a talk on a deep learning method called GloVe (Global Vectors for Word Representation) at Idibon, taking a deep dive into the world of compositional deep learning and recent comparative results on both projects and other similar efforts.
The ‘map’ analogy still applies, although in all cases it is a little more complex: there aren’t just two dimensions (East-West and North-South) but potentially 1000s depending on the size of the vector, and because of this researchers tend to measure closeness in slightly more complicated ways than ‘straight line’ distance, because straight line distance is not reliable in high-dimensional space. Despite this, it is still relatively easy to visualize the relationships:
Vectors for gender relationships (left) and adding an extra dimension for
plural relationships (right) from Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig's
"Linguistic Regularities in Continuous Space Word Representations" (2013)
When Mikolov, Yih, and Zweig presented their paper stating that the dimensions of space in the vector actually also represented dimensions of meaning that pick up semantic attributes as a word, it allowed for these programs to be able to make analogies in meaning as humans do:
man:: woman king:: X
A human who speaks English can probably easily guess this should be queen, as the program developed by Mikolov, Yih and Zweig also was able to do (Show image of the vector space).
These systems are called distributed word systems (versus the atomic models), what are termed count models and predict models. Latent Semantic Analysis (LSA) is a count model, and Baroni in his 2014 paper believed that these models would be better than predict models such as that of models such as Word2Vec which use continuous bag of words (CBOW) and skip grams. While count models such as LSA train quickly and make efficient use of statistics, predict models are far better in catching complex patterns in word similarity.
The GloVe Model
Ideally, one would combine these models to produce even better results. Pennington (Pennington, Socher, and Manning EMNLP 2014) noticed that by taking the ratio of co-occurance probabilities, one can encode meaning components. By doing clever scaling (such as done with COALS, Rohde, Gonnerman & Plaut ms. 2005), one can handle infrequent or too frequent items. The following shows the six most closely related words to frog:
Above, left-to-right, top-to-bottom: 1. toad, 2. litoria, 3. leptodactylidae, 4. rana, 5. lizard, 6. eleutherodactylus
Mikolov’s original results with analogy worked well primarily on mapping countries to capitals and countries to currencies. With taking the first choice, Mikolov’s 2012 and 2013 word analogies yielded 36.1% accuracy with continuous bag of words. On a first pass, GloVe produced a result of 70.3% accuracy. After Word2Vec and GloVe were both retrained on a corpus of 6 billion documents (as opposed to originally Word2Vec being trained on 1.4 billion documents and GloVe on 1.6 billion), it yielded a 65.7% accuracy and GloVe 71.7%
These results are very promising for the use of word analogies in text analytics. Combined with other features or approaches, the potential for more powerful tools would allow data scientists to have more accurate and meaningful models for sentiment, fraud, and other common natural language processing use cases. All of these things guarantee that the next few years of development in computational linguistics will continue to produce novel and potentially vastly improved results.
Comments