OpenSeek is dedicated to uniting the global open-source community to drive collaborative innovation in algorithms, data, and systems, with the goal of developing next-generation models that surpass DeepSeek.
English| ็ฎไฝไธญๆ
OpenSeek is an open source project initiated by the Beijing Academy of Artificial Intelligence (BAAI), aiming to unite the global open source communities to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek. Drawing inspiration from large model initiatives like Bigscience and OPT, the project is dedicated to building an independent open source algorithmic innovation system. Since the open sourcing of the DeepSeek model, academia has seen numerous algorithmic improvements and breakthroughs, but these innovations often lack complete code implementations, necessary computational resources, and high-quality data support. The OpenSeek project aims to explore high-quality dataset construction mechanisms through uniting the open source community, promote open sourcing of the entire large model training pipeline, build innovative training and inference code to support various AI chips besides Nvidia, and promote independent technological innovation and application development.
Objectives of OpenSeek:
- Advanced data technology: Address the challenge of acquiring high-quality data.
- Multiple AI devices support: Reduce dependency on specific chips and improve model universality and adaptability.
- Standalised LLM training baseline: Promote independent algorithmic innovation and technology sharing through open source collaboration.
Project: https://github.com/orgs/FlagAI-Open/projects/1
Acknowledgments & Contribution Guidelines
Thanks to FlagScale team for their support for OpenSeek Training.
-
For system-related improvements Please report framework-specific issues to FlagScale's GitHub Issues. Code contributions should be submitted via Pull Requests (PRs) to the FlagScale.
-
For data & algorithm improvements Discussions of dataset implementations, training optimizations, and experimental configurations in here.
For detailed information on how to contribute, please refer to our Contribution Guide. Feel free to contact us. [Discord channel]
- ๐ฅ[05/06/2025] Data group-release bilingual pretrainning dataset CCI4.0-M2-V1 [readme], Algo group-release the pretrained model OpenSeek-Small V1 [readme][download].
- ๐ฅ[03/20/2025] #4 online meetup 19:00-20:00 : [screen recording]
- ๐ฅ[03/20/2025] #3 online meetup 19:00-20:00 ๏ผ[screen recording]
- ๐ฅ[03/06/2025] #2 online meetup 19:00-20:00 ๏ผ[screen recording]
- ๐ฅ[02/25/2025] #1 online meetup 18:00-19:00 ๏ผ[screen recording]
- ๐ฅ[02/13/2025] Completed experiments on OpenSeek-PT-1T dataset, more.
The openseek-baseline is used as the baseline for PAZHOU algorithm competition and also used to evaluate the PRs in openseek. Openseek-baseline is a standarlized LLM training and evaluating pipline, it consist of a 100B dataset, a training code, wandb, checkpoint and evaluation results.
- Clone this repository and enter the directory:
git clone https://github.com/FlagAI-Open/OpenSeek.git
cd OpenSeek
- Install the FlagScale dependencies:
- Using Docker (Recommend)
# Pull images
docker pull openseek2025/openseek:flagscale-20250527
# Clone the repository
git clone https://github.com/FlagOpen/FlagScale.git
- From Source:
# Clone the repository
git clone https://github.com/FlagOpen/FlagScale.git
# Install the requirements
cd FlagScale/install
./install-requirements.sh --env train
Download the OpenSeek-Pretrain-100B dataset to local dir named OpenSeek-Pretrain-100B in OpenSeek.
You can also run the following script to build up your project environment after you have built python environment and activated it:
bash openseek/baseline/setup.sh
Make sure you have completed the environment installation and configuration as outlined in the previous section and your OpenSeek folder should be like this:
OpenSeek
โโโ OpenSeek-Pretrain-100B (Dataset directory for downloaded datasets.)
โโโ FlagScale (FlagScale directory cloned from GitHub.)
โโโ OpenSeek-Small-v1-Baseline (Experiment directory will be created automatically and contains logs and model checkpoints etc.)
โโโ ...
Next, you can run the baseline with a simple command:
bash openseek/baseline/run_exp.sh start
After executing
bash openseek/baseline/run_exp.sh start
, you can follow these steps to confirm your program is running as expected.
navigate to the OpenSeek root directory. You'll notice a new folder named
OpenSeek-Small-v1-Baseline
has been created in this directory. This is the log dir.You can view the program's logs and error messages by opening
OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
with a text editor likevim
:vi OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
If the program is running correctly, after approximately 1-2 minutes, you can execute the following command from the OpenSeek root directory:
grep "iteration.*consumed samples" OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
If the output resembles the example below, it indicates that your program has successfully started:
[default0]: [2025-05-27 15:23:07] iteration 1/ 24000 | consumed samples: 1024 | elapsed time per iteration (ms): 271607.0 | throughput per GPU (TFLOP/s/GPU): 40.4 | learning rate: 1.500000E-06 | global batch size: 1024 | lm loss: 1.192634E+01 | load_balancing_loss: 1.041994E+00 | loss scale: 1.0 | grad norm: 5.960 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations: 0 | number of nan iterations: 0 |
Target: We construct a large-scale multilingual pretraining dataset exceeding 10 trillion tokens, covering a diverse range of languages and domains. To further improve data quality and training efficiency, we incorporate data synthesis techniques, such as chain-of-thought generation and instruction tuning.
CCI4.0-M2 V1 is a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajector:
CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 | |
---|---|---|
Download | huggingface | huggingface |
Notes | 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra due to the license concern. | 430 million CoT sample covers math, code, arxiv, wiki and webpage |
In addition to the main suite, OpenSeek-Pretrain-100B was randomly sampled from the CCI4.0-M2 v1 datasets. This 100B data subset is specifically used for experimental training purposes.
Your can find more details about data here.
Target: Our study focuses on three key aspects of large-scale language model training: data mixing, hyperparameter tuning, and reinforcement learning (RL). We systematically explore data composition strategies to balance quality and diversity across domains, investigate the impact of hyperparameter configurations on training stability and convergence, and incorporate RL-based optimization to further align model behavior with task-specific objectives.
OpenSeek-Small-v1-Baseline | OpenSeek-Small-v1 | |
---|---|---|
Parameter size | 1.4B (0.4B active) | 1.4B (0.4B active) |
Number of tokens | 100B | 720B |
Checkpoint | huggingface | huggingface |
Wandb | wandb | wandb |
Evaluation | evaluation | evaluation |
Experiment Config | Experiment Config | Experiment Config |
Training config | Training Config | Training Config |
Notes | This model is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. | OpenSeek-Small v1 is the first-stage production model from the OpenSeek project, designed as a foundation for next-generation language models. |
The usage and difference of Experiment Config and Training Config are explained here.
Target๏ผWith support from the open-source community, flagscale aims to reproduce DeepSeek V3 & R1โs distributed training system, ensuring stable and performant end-to-end training.
- distributed training
- data mixture experiment
- data mixture experiment results
- algorithm experiment
- algorithm experiment results
- system experiment
- Apache 2.0