Skip to content

OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek.

License

Notifications You must be signed in to change notification settings

FlagAI-Open/OpenSeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

OpenSeek Logo
Homepage Hugging Face Discord Wechat
License GitHub stars GitHub forks GitHub issues

OpenSeek is dedicated to uniting the global open-source community to drive collaborative innovation in algorithms, data, and systems, with the goal of developing next-generation models that surpass DeepSeek.

English| ็ฎ€ไฝ“ไธญๆ–‡

๐Ÿ“Œ Project Overview

OpenSeek is an open source project initiated by the Beijing Academy of Artificial Intelligence (BAAI), aiming to unite the global open source communities to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek. Drawing inspiration from large model initiatives like Bigscience and OPT, the project is dedicated to building an independent open source algorithmic innovation system. Since the open sourcing of the DeepSeek model, academia has seen numerous algorithmic improvements and breakthroughs, but these innovations often lack complete code implementations, necessary computational resources, and high-quality data support. The OpenSeek project aims to explore high-quality dataset construction mechanisms through uniting the open source community, promote open sourcing of the entire large model training pipeline, build innovative training and inference code to support various AI chips besides Nvidia, and promote independent technological innovation and application development.

Objectives of OpenSeek:

  • Advanced data technology: Address the challenge of acquiring high-quality data.
  • Multiple AI devices support: Reduce dependency on specific chips and improve model universality and adaptability.
  • Standalised LLM training baseline: Promote independent algorithmic innovation and technology sharing through open source collaboration.

Project: https://github.com/orgs/FlagAI-Open/projects/1

Acknowledgments & Contribution Guidelines

Thanks to FlagScale team for their support for OpenSeek Training.

  • For system-related improvements Please report framework-specific issues to FlagScale's GitHub Issues. Code contributions should be submitted via Pull Requests (PRs) to the FlagScale.

  • For data & algorithm improvements Discussions of dataset implementations, training optimizations, and experimental configurations in here.

For detailed information on how to contribute, please refer to our Contribution Guide. Feel free to contact us. [Discord channel]

wechat

๐Ÿ“ข News

  • ๐Ÿ”ฅ[05/06/2025] Data group-release bilingual pretrainning dataset CCI4.0-M2-V1 [readme], Algo group-release the pretrained model OpenSeek-Small V1 [readme][download].
  • ๐Ÿ”ฅ[03/20/2025] #4 online meetup 19:00-20:00 : [screen recording]
  • ๐Ÿ”ฅ[03/20/2025] #3 online meetup 19:00-20:00 ๏ผš[screen recording]
  • ๐Ÿ”ฅ[03/06/2025] #2 online meetup 19:00-20:00 ๏ผš[screen recording]
  • ๐Ÿ”ฅ[02/25/2025] #1 online meetup 18:00-19:00 ๏ผš[screen recording]
  • ๐Ÿ”ฅ[02/13/2025] Completed experiments on OpenSeek-PT-1T dataset, more.

๐Ÿš— Getting Started

What is Baseline

The openseek-baseline is used as the baseline for PAZHOU algorithm competition and also used to evaluate the PRs in openseek. Openseek-baseline is a standarlized LLM training and evaluating pipline, it consist of a 100B dataset, a training code, wandb, checkpoint and evaluation results.

Preparing Enviroment

  1. Clone this repository and enter the directory:
git clone https://github.com/FlagAI-Open/OpenSeek.git
cd OpenSeek
  1. Install the FlagScale dependencies:
  • Using Docker (Recommend)
# Pull images
docker pull openseek2025/openseek:flagscale-20250527

# Clone the repository
git clone https://github.com/FlagOpen/FlagScale.git
  • From Source:
# Clone the repository
git clone https://github.com/FlagOpen/FlagScale.git

# Install the requirements
cd FlagScale/install
./install-requirements.sh --env train

Preparing the data

Download the OpenSeek-Pretrain-100B dataset to local dir named OpenSeek-Pretrain-100B in OpenSeek.

You can also run the following script to build up your project environment after you have built python environment and activated it:

bash openseek/baseline/setup.sh

Running the Baseline

Make sure you have completed the environment installation and configuration as outlined in the previous section and your OpenSeek folder should be like this:

OpenSeek
โ”œโ”€โ”€ OpenSeek-Pretrain-100B (Dataset directory for downloaded datasets.)
โ”œโ”€โ”€ FlagScale (FlagScale directory cloned from GitHub.)
โ”œโ”€โ”€ OpenSeek-Small-v1-Baseline (Experiment directory will be created automatically and contains logs and model checkpoints etc.)
โ”œโ”€โ”€ ...

Next, you can run the baseline with a simple command:

bash openseek/baseline/run_exp.sh start

How to Verify Your Program is Running Correctly

After executing bash openseek/baseline/run_exp.sh start, you can follow these steps to confirm your program is running as expected.

  1. navigate to the OpenSeek root directory. You'll notice a new folder named OpenSeek-Small-v1-Baseline has been created in this directory. This is the log dir.

  2. You can view the program's logs and error messages by opening OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output with a text editor like vim:

    vi OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
    
  3. If the program is running correctly, after approximately 1-2 minutes, you can execute the following command from the OpenSeek root directory:

    grep "iteration.*consumed samples" OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
    

    If the output resembles the example below, it indicates that your program has successfully started:

    [default0]: [2025-05-27 15:23:07] iteration        1/    24000 | consumed samples:          1024 | elapsed time per iteration (ms): 271607.0 | throughput per GPU (TFLOP/s/GPU): 40.4 | learning rate: 1.500000E-06 | global batch size:  1024 | lm loss: 1.192634E+01 | load_balancing_loss: 1.041994E+00 | loss scale: 1.0 | grad norm: 5.960 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:    0 | number of nan iterations:    0 |
    

๐Ÿ“š Data Group

Target: We construct a large-scale multilingual pretraining dataset exceeding 10 trillion tokens, covering a diverse range of languages and domains. To further improve data quality and training efficiency, we incorporate data synthesis techniques, such as chain-of-thought generation and instruction tuning.

Stage1 results

CCI4.0-M2 V1 is a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajector:

CCI4.0-M2-Base v1 CCI4.0-M2-CoT v1
Download huggingface huggingface
Notes 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra due to the license concern. 430 million CoT sample covers math, code, arxiv, wiki and webpage

In addition to the main suite, OpenSeek-Pretrain-100B was randomly sampled from the CCI4.0-M2 v1 datasets. This 100B data subset is specifically used for experimental training purposes.

Your can find more details about data here.

๐Ÿš€ Algorithm Group

Target: Our study focuses on three key aspects of large-scale language model training: data mixing, hyperparameter tuning, and reinforcement learning (RL). We systematically explore data composition strategies to balance quality and diversity across domains, investigate the impact of hyperparameter configurations on training stability and convergence, and incorporate RL-based optimization to further align model behavior with task-specific objectives.

Stage1 results

OpenSeek-Small-v1-Baseline OpenSeek-Small-v1
Parameter size 1.4B (0.4B active) 1.4B (0.4B active)
Number of tokens 100B 720B
Checkpoint huggingface huggingface
Wandb wandb wandb
Evaluation evaluation evaluation
Experiment Config Experiment Config Experiment Config
Training config Training Config Training Config
Notes This model is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. OpenSeek-Small v1 is the first-stage production model from the OpenSeek project, designed as a foundation for next-generation language models.

The usage and difference of Experiment Config and Training Config are explained here.

๐Ÿ–ฅ๏ธ System Group

Target๏ผšWith support from the open-source community, flagscale aims to reproduce DeepSeek V3 & R1โ€™s distributed training system, ensuring stable and performant end-to-end training.

Stage1 results

wechat

Experiments & Advanced usage

๐Ÿ“œ License Agreement

  • Apache 2.0

About

OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek.

Resources

License

Stars

Watchers

Forks

Packages

No packages published