BaseLCM

This will be the implementation of LCM but only the transformers-base LCM

From my understanding the LCM basically focuses more on sentences rather than tokens and since it focuses more on sentences it is able to bypass the transformers quadratic scaling issue with the increase in sequence length, but also since it works only on sentences it has its limitations like cannot really focus on the entire context but this seems to be useful for meta social media app where users typically communicate through short text

Important

This implementation currently uses the Sonar Encoder from Hugging Face. According to the model creator, this encoder is the standard version and does not include custom modifications.

Sonar Decoder Not Integrated Yet: At this stage, only the encoder is implemented. The decoder will be added in a future update to complete the model.
Work in Progress: This implementation is still incomplete, and further enhancements are planned.

So the baselcm has five components:

Sonar Encoder :This is a pre-trained model released by meta which will encode the sentences into embeddings
Pre-net : which normalizes the input SONAR embeddings and maps them to the model’s hidden dimension
Transformer-decoder : which transduces a sequence of preceding concepts (read sentence embeddings) into a sequence of future ones
Post-net : which maps the hidden representations produced by the Transformer-Decoder back to the original embedding space

The workflow is as follows:

First we download the dataset any dataset from hugginface
We then split those chunks of texts into sentences using spacy
Then we pass those sentences to the sonar encoder
We then add noise to those embeddings provided by the sonar model
We then train the model after this

Loss Function and Training Objective

The Loss Function in Base-LCM is the Mean Squared Error (MSE), which measures the difference between the predicted next concept embedding and the true next concept embedding. The model is trained to minimize this loss, effectively learning to predict the next concept in a sequence.

The Training Objective is to optimize the model's parameters (\theta) so that it can accurately predict the next concept (x_n) from a sequence (x_{<n}).

Getting started

To get started please , follow these steps:

Clone the Repository

git clone https://github.com/dame-cell/BaseLCM.git
cd BaseLCM
pip install -r requirements.txt

Train the Model You can train the model using a smaller version of the FineWeb-Edu dataset for testing purposes. Here's an example command:

python3 src/train.py --hf_data beomi/fineweb-edu-fortified-mini \
                     --input_dim 1024 \
                     --output_dim 1024 \
                     --batch_size 16 \
                     --epoch 3 \
                     --num_heads 8 \
                     --num_layers 12 \
                     --data_sample 1000

To Do:

Make the process for encoding the sentences using Sonar faster maybe adding support for multi-gpu
Add the decoder Sonar so that we can test the generation

Citations

@article{lcm2024,
  author = {{LCM team}, Lo\"{i}c Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-juss\`{a}, David Dale, Hady Elsahar, Kevin Heffernan, Jo\~{a}o Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk},
  title = {{Large Concept Models}: Language Modeling in a Sentence Representation Space},
  publisher = {arXiv},
  year = {2024},
  url = {https://arxiv.org/abs/2412.08821},
}

@misc{Duquenne:2023:sonar_arxiv,
  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
  publisher = {arXiv},
  year = {2023},
  url = {https://arxiv.org/abs/2308.11466},
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BaseLCM

Loss Function and Training Objective

Getting started

To Do:

Citations

About

Releases

Packages

Languages

License

dame-cell/BaseLCM

Folders and files

Latest commit

History

Repository files navigation

BaseLCM

Loss Function and Training Objective

Getting started

To Do:

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages