FastText Tutorial – We shall learn how to make a model learn Word Representations in FastText by training word vectors using Unsupervised Learning techniques.

Learn Word Representations in FastText

For training using machine learning, words and sentences could be represented in a more numerical and efficient way called Word Vectors. FastText provides tools to learn these word representations, that could boost accuracy numbers for text classification and such.

Input Data

Unlike supervised learning, unsupervised learning doesn’t require labelled data. So, any of the word dumps could be used as input data to train the model for learning word representations. For example, you may find many dumps from wiki at [https://dumps.wikimedia.org/enwiki/latest/], if you want to try training your model with huge amount of data corpus.

For this tutorial, we shall use sample data as shown below :

Text Classification is one of the important NLP (Natural Language Processing) task with wide range of application in solving problems like Document Classification, Sentiment Analysis, Email SPAM Classification, Tweet Classification etc.

FastText provides “supervised” module to build a model for Text Classification using Supervised learning.

To work with fastText, it has to be built from source. To build fastText, follow the fastText Tutorial – How to build FastText library from github source. Once fastText is built, run the fasttext commands mentioned in the following tutorial from the location of fasttest executable.

But, please remember that, for any useful model to be trained, you may need lot of data corpus w.r.t your use case, at least a billion words.

ADVERTISEMENT

Train model to learn Word Representations

To train word vectors, FastText provides two techniques. They are

  • Continuous Bag Of Words (CBOW)
  • SkipGram

Training Continuous Bag Of Words (CBOW) Model

Following is the syntax to train word vectors using CBOW model.

$ ./fasttext cbow -input <input_file> -output <output_file>

Example

We shall use the data in a text file that is provided in the input data section, as training data.

$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
Read 0M words
Number of words:  2
Number of labels: 0
Progress: 100.0%  words/sec/thread: 33  lr: 0.000000  loss: 0.000000  eta: 0h0m

cbowModel.bin is created after training.

Training SkipGram Model

Following is the syntax to train word vectors using CBOW model.

$ ./fasttext skipgram -input <input_file> -output <output_file>

Example

We shall use the data in a text file that is provided in the input data as training data.

$ ./fasttext skipgram -input wordRepTrainingData.txt -output cbowModel
$ ./fasttext skipgram -input wordRepTrainingData.txt -output skipGramModel
Read 0M words
Number of words: 2
Number of labels: 0
Progress: 100.0% words/sec/thread: 27 lr: 0.000000 loss: 0.000000 eta: 0h0m

skipGramModel.bin is created after training.

Print Word Vectors

Once the model is generated, we shall have a look on how to calculate word vectors for some input words :

Example : Calculate word vector for the word “Classification”

$ echo "Classification" | ./fasttext print-word-vectors cbowModel.bin 
Classification -0.0016351 -0.00038951 -0.00069403 -0.00055687 4.6813e-05 0.00084484 -0.00032377 -0.0014186 -0.00010761 0.00096472 0.00041914 0.0018084 -0.00021441 0.0016066 -0.00025791 -0.00013698 0.0015549 0.00080067 -0.0011226 -0.0001057 0.00077716 3.0814e-05 -0.0008903 0.00051218 0.0010777 -0.00021787 0.0004454 -9.1978e-05 0.0013804 -0.00065836 -0.00012421 0.00090651 -0.00076955 0.00015702 -6.6829e-05 0.00037686 -0.00082451 -0.00089599 -4.8236e-05 0.0011861 -0.00053301 0.0013759 -0.00050949 -0.00052694 -0.00025271 0.00018434 0.00069015 0.00022772 -0.0006613 -0.00024038 0.00082301 -0.001342 -0.00023147 4.6686e-05 -0.0021591 -0.0012267 0.00016453 -7.0963e-05 0.00012941 -0.00033523 -0.00025687 -0.0016622 0.0011311 0.00031574 0.00051476 0.00021078 -0.0010296 -0.00077612 -0.0002647 0.00040547 0.00022524 7.8208e-06 -0.0012234 -0.0012435 0.00084114 -0.0021134 -0.00032346 -0.00037915 -0.0011645 -0.00055294 0.000298 0.00022919 -0.00040574 0.0010034 0.00027639 0.00071129 -0.00096475 -0.00088694 -0.00020765 0.00017506 -0.00074152 -0.00063677 -0.0018727 -0.00081131 -0.00027694 0.00061828 -0.00024931 -0.0011524 0.00021265 -0.00024279

Print Sentence Vectors

We could also calculate sentence vectors using the CBOW and SkipGram models that we generated.

Example: Calculate sentence vector for the sentence “Text Classification”

$ echo "Text Classification" | ./fasttext print-sentence-vectors cbowModel.bin 
Text Classification -0.10849 0.0073465 0.010102 -0.063361 -0.059639 0.056901 -0.06169 -0.04626 0.015623 0.079396 0.063662 0.13331 -0.10584 0.1265 -0.070325 -0.094202 0.082804 0.066358 -0.033852 0.039573 0.0044317 -0.042774 -0.14243 0.010955 0.053763 0.011553 0.072239 -0.10154 0.007844 -0.028087 -0.057292 0.016036 -0.11378 0.026555 -0.043418 -0.00021922 0.053161 -0.024643 0.044737 0.11826 -0.086438 0.062033 0.0086412 -0.064439 0.044403 -0.030381 0.073831 0.0065884 -0.14511 0.049224 0.1389 -0.0043203 0.05156 0.028902 -0.15638 -0.11769 0.01515 0.050197 0.025984 -0.030021 -0.028685 -0.12303 0.0008013 0.084163 0.025181 0.016443 -0.08329 -0.0037237 -0.016232 0.044954 -0.0032083 0.008169 -0.10068 -0.12146 -0.013546 -0.27842 -0.042486 -0.088876 -0.084226 -0.0492 0.096401 0.01784 -0.028391 0.019633 0.09417 0.10986 -0.055056 -0.051792 -0.11848 0.025789 -0.013399 -0.12246 -0.11678 -0.018821 0.07682 0.007471 0.015359 -0.003884 -0.02354 -0.0035358

We have printed word and sentence vectors using CBOW model. You may try with SkipGram model as a practice. All you need to do is providing skipGramModel.bin instead of cbowModel.bin in the commands.

Conclusion

In this FastText Tutorial, we have learnt to make a model learn Word Representations in FastText using Unsupervised Learning techniques – CBOW (Continuous Bag of Words) and SkipGram. And also calculate word vectors for words and sentences.