fasttext

ben.wangzLess than 1 minute

fasttext

Introduction

FastText is a library for efficient learning of word representations and sentence classification
It allows users to learn text representations and text classifiers
It is written in C++ and supports distributed training
A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers.

prepare

prepare pre-trained model

curl -LO https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
unzip wiki.en.zip

prepare python script vectorization.py

import fasttext
import os

model_path = os.getenv("MODEL_PATH", default="/app/model/wiki.en.bin")
sentence = os.getenv("SENTENCE", default="hello world")
model = fasttext.load_model(model_path)
vector = model.get_sentence_vector(sentence)
print(vector)

run with container

# NOTE: need more than 16GB memory
podman run --rm \
    -v $(pwd)/wiki.en.bin:/app/model/wiki.en.bin \
    -v $(pwd)/vectorization.py:/app/vectorization.py \
    -e MODEL_PATH=/app/model/wiki.en.bin \
    -e SENTENCE="On a freezing New Year's Eve, a poor young girl, shivering, bareheaded and barefoot, unsuccessfully tries to sell matches in the street." \
    -it docker.io/library/python:3.12.1-bullseye \
        bash -c "pip install -i https://mirrors.aliyun.com/pypi/simple/ fasttext==0.9.2 && python3 /app/vectorization.py"

reference

https://github.com/facebookresearch/fastText