In this tutorial, we learn how to make a model learn Word Representations using FastText in Python by training word vectors using Unsupervised Learning techniques.

Learn Word Representations in FastText

For training using machine learning, words and sentences could be represented in a more numerical and efficient way called Word Vectors. FastText provides tools to learn these word representations, that could boost accuracy numbers for text classification and such.

1. Install FastText in Python

Cython is a prerequisite to install fasttext. To install Cython, run the following command in Terminal :

$ pip install Cython --install-option="--no-cython-compile"

To use fasttext in python program, install it using the following command :

$ pip install fasttext
root@arjun-VPCEH26EN:~# pip install fasttext
Collecting fasttext
  Using cached fasttext-0.8.3.tar.gz
Collecting numpy>=1 (from fasttext)
  Downloading numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB)
    100% |????????????????????????????????| 16.6MB 48kB/s 
Collecting future (from fasttext)
  Downloading future-0.16.0.tar.gz (824kB)
    100% |????????????????????????????????| 829kB 228kB/s 
Building wheels for collected packages: fasttext, future
  Running setup.py bdist_wheel for fasttext ... done
  Stored in directory: /root/.cache/pip/wheels/55/0a/95/e23f773666d3487ee7456b220f7e8d37e99b74833b20dd06a0
  Running setup.py bdist_wheel for future ... done
  Stored in directory: /root/.cache/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a
Successfully built fasttext future
Installing collected packages: numpy, future, fasttext
Successfully installed fasttext-0.8.3 future-0.16.0 numpy-1.13.1
root@arjun-VPCEH26EN:~# 

FastText is successfully installed in Python.

2. Input Data

But, please remember that, for any useful model to be trained, you may need lot of data corpus w.r.t your use case, at least a billion words. Input could be given as a text file.

3. Train model to Learn Word Representations

To train word vectors, FastText provides two techniques. They are

  • Continuous Bag Of Words (CBOW)
  • SkipGram

4. Train a CBOW model

Following is the example to build a CBOW model.

</>
Copy
import fasttext

# CBOW model
model = fasttext.cbow('TrainingData.txt', 'model')
print model.words # list of words in dictionary

print model['machine'] # get the vector of the word 'machine'

Running the above python program creates two files. One is model file (with .bin extension) containing trained parameters and the other is vector file (with .vec extension) containing vector representations of words in the training data file.

5. Train a SkipGram model

Following is the example to build a CBOW model.

</>
Copy
import fasttext

# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary

print model['machine'] # get the vector of the word 'machine'

Running the above python program creates two files. One is model file (with .bin extension) containing trained parameters and the other is vector file (with .vec extension) containing vector representations of words in the training data file.

6. Use a pre-trained model

To use a trained model (the output of above cbow model training or skipgram model training) at some other computer or in future, following example demonstrates the usage.

</>
Copy
import fasttext
model = fasttext.load_model('cbowModel.bin')
print model['machine'] # get the vector of the word 'machine'

7. Print all words in the dictionary

To get the list of all words in the dictionary (model), following example python program demonstrates the usage.

</>
Copy
import fasttext
model = fasttext.load_model('cbowModel.bin')
print model.words # list of words in dictionary

Conclusion

In this FastText Tutorial, we have learnt how to make models learn word representations using unsupervised learning techniques using fasttext in python programming language.