FastText Tutorial – We shall learn how to make a model learn Word Representations in FastText by training word vectors using Unsupervised Learning techniques.
Learn Word Representations in FastText
For training using machine learning, words and sentences could be represented in a more numerical and efficient way called Word Vectors. FastText provides tools to learn these word representations, that could boost accuracy numbers for text classification and such.
Input Data
Unlike supervised learning, unsupervised learning doesn’t require labelled data. So, any of the word dumps could be used as input data to train the model for learning word representations. For example, you may find many dumps from wiki at https://dumps.wikimedia.org/enwiki/latest/, if you want to try training your model with huge amount of data corpus.
For this tutorial, we shall use sample data as shown below :
Text Classification is one of the important NLP (Natural Language Processing) task with wide range of application in solving problems like Document Classification, Sentiment Analysis, Email SPAM Classification, Tweet Classification etc.
FastText provides “supervised” module to build a model for Text Classification using Supervised learning.
To work with fastText, it has to be built from source. To build fastText, follow the fastText Tutorial – How to build FastText library from github source. Once fastText is built, run the fasttext commands mentioned in the following tutorial from the location of fasttest executable.
But, please remember that, for any useful model to be trained, you may need lot of data corpus w.r.t your use case, at least a billion words.
Train model to learn Word Representations
To train word vectors, FastText provides two techniques. They are
- Continuous Bag Of Words (CBOW)
- SkipGram
Training Continuous Bag Of Words (CBOW) Model
Following is the syntax to train word vectors using CBOW model.
$ ./fasttext cbow -input <input_file> -output <output_file>
Example
We shall use the data in a text file that is provided in the input data section, as training data.
$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
Read 0M words
Number of words: 2
Number of labels: 0
Progress: 100.0% words/sec/thread: 33 lr: 0.000000 loss: 0.000000 eta: 0h0m
cbowModel.bin is created after training.
Training SkipGram Model
Following is the syntax to train word vectors using CBOW model.
$ ./fasttext skipgram -input <input_file> -output <output_file>
Example
We shall use the data in a text file that is provided in the input data as training data.
$ ./fasttext skipgram -input wordRepTrainingData.txt -output cbowModel
$ ./fasttext skipgram -input wordRepTrainingData.txt -output skipGramModel
Read 0M words
Number of words: 2
Number of labels: 0
Progress: 100.0% words/sec/thread: 27 lr: 0.000000 loss: 0.000000 eta: 0h0m
skipGramModel.bin is created after training.
Print Word Vectors
Once the model is generated, we shall have a look on how to calculate word vectors for some input words :
Example : Calculate word vector for the word “Classification”
$ echo "Classification" | ./fasttext print-word-vectors cbowModel.bin
Classification -0.0016351 -0.00038951 -0.00069403 -0.00055687 4.6813e-05 0.00084484 -0.00032377 -0.0014186 -0.00010761 0.00096472 0.00041914 0.0018084 -0.00021441 0.0016066 -0.00025791 -0.00013698 0.0015549 0.00080067 -0.0011226 -0.0001057 0.00077716 3.0814e-05 -0.0008903 0.00051218 0.0010777 -0.00021787 0.0004454 -9.1978e-05 0.0013804 -0.00065836 -0.00012421 0.00090651 -0.00076955 0.00015702 -6.6829e-05 0.00037686 -0.00082451 -0.00089599 -4.8236e-05 0.0011861 -0.00053301 0.0013759 -0.00050949 -0.00052694 -0.00025271 0.00018434 0.00069015 0.00022772 -0.0006613 -0.00024038 0.00082301 -0.001342 -0.00023147 4.6686e-05 -0.0021591 -0.0012267 0.00016453 -7.0963e-05 0.00012941 -0.00033523 -0.00025687 -0.0016622 0.0011311 0.00031574 0.00051476 0.00021078 -0.0010296 -0.00077612 -0.0002647 0.00040547 0.00022524 7.8208e-06 -0.0012234 -0.0012435 0.00084114 -0.0021134 -0.00032346 -0.00037915 -0.0011645 -0.00055294 0.000298 0.00022919 -0.00040574 0.0010034 0.00027639 0.00071129 -0.00096475 -0.00088694 -0.00020765 0.00017506 -0.00074152 -0.00063677 -0.0018727 -0.00081131 -0.00027694 0.00061828 -0.00024931 -0.0011524 0.00021265 -0.00024279
Print Sentence Vectors
We could also calculate sentence vectors using the CBOW and SkipGram models that we generated.
Example: Calculate sentence vector for the sentence “Text Classification”
$ echo "Text Classification" | ./fasttext print-sentence-vectors cbowModel.bin
Text Classification -0.10849 0.0073465 0.010102 -0.063361 -0.059639 0.056901 -0.06169 -0.04626 0.015623 0.079396 0.063662 0.13331 -0.10584 0.1265 -0.070325 -0.094202 0.082804 0.066358 -0.033852 0.039573 0.0044317 -0.042774 -0.14243 0.010955 0.053763 0.011553 0.072239 -0.10154 0.007844 -0.028087 -0.057292 0.016036 -0.11378 0.026555 -0.043418 -0.00021922 0.053161 -0.024643 0.044737 0.11826 -0.086438 0.062033 0.0086412 -0.064439 0.044403 -0.030381 0.073831 0.0065884 -0.14511 0.049224 0.1389 -0.0043203 0.05156 0.028902 -0.15638 -0.11769 0.01515 0.050197 0.025984 -0.030021 -0.028685 -0.12303 0.0008013 0.084163 0.025181 0.016443 -0.08329 -0.0037237 -0.016232 0.044954 -0.0032083 0.008169 -0.10068 -0.12146 -0.013546 -0.27842 -0.042486 -0.088876 -0.084226 -0.0492 0.096401 0.01784 -0.028391 0.019633 0.09417 0.10986 -0.055056 -0.051792 -0.11848 0.025789 -0.013399 -0.12246 -0.11678 -0.018821 0.07682 0.007471 0.015359 -0.003884 -0.02354 -0.0035358
We have printed word and sentence vectors using CBOW model. You may try with SkipGram model as a practice. All you need to do is providing skipGramModel.bin instead of cbowModel.bin in the commands.
Conclusion
In this FastText Tutorial, we have learnt to make a model learn Word Representations in FastText using Unsupervised Learning techniques – CBOW (Continuous Bag of Words) and SkipGram. And also calculate word vectors for words and sentences.