* add more demos (#12)

modelscope · Aug 10, 2023 · d4ab729 · d4ab729
1 parent 7a06df6
commit d4ab729
Show file tree

Hide file tree

Showing 20 changed files with 1,413 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -203,7 +203,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
 - Data Processing:
   - Scientific Literature (e.g. [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
   - Programming Code (e.g. [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
-  - Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
+  - Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
 - Tool Pool:
   - Dataset Splitting by Language [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
   - Quality Classifier for CommonCrawl [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]

diff --git a/README_ZH.md b/README_ZH.md
@@ -200,7 +200,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
 * 数据处理:
   * 科学文献 (例如 [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
   * 编程代码 (例如 [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
-  * 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
+  * 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
 * 工具池:
   * 按语言分割数据集 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
   * CommonCrawl 质量分类器 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]

diff --git a/demos/README.md b/demos/README.md
@@ -16,6 +16,9 @@ streamlit run app.py
 - Data (`data`)
   - This folder contains some sample datasets.
 
+- Overview scan (`overview_scan`)
+  - This demo introduces the basic concepts and functions of Data-Juicer, such as features, configuration, operators, and so on.
+
 - Data process loop (`data_process_loop`)
   - This demo analyzes and processes a dataset, providing a comparison of statistical information before and after the processing.
 
@@ -28,17 +31,20 @@ streamlit run app.py
 - Data visualization statistics (`data_visualization_statistics`)
   - This demo analyzes the dataset and obtain up to 13 statistics.
 
+- Process SFT Chinese data (`process_sft_zh_data`)
+  - This demos analyzes and processes part of Chinese dataset in Alpaca-CoT to show how to process IFT or SFT data for LLM fine-tuning.
+
+- Process SCI data (`process_sci_data`)
+  - This demos analyzes and processes part of arXiv dataset to show how to process scientific literature data for LLM pre-training.
+
+- Process code data (`process_code_data`)
+  - This demos analyzes and processes part of Stack-Exchange dataset to show how to process code data for LLM pre-training.
+
 - Text quality classifier (`tool_quality_classifier`)
   - This demo provides 3 text quality classifier to score the dataset.
 
 - Dataset splitting by language (`tool_dataset_splitting_by_language`)
   - This demo splits a dataset to different sub-datasets by language.
 
-## Demos Coming Soon
-- Overview scan
-- Auto evaluation helm
-- Data mixture
-- SFT data zh
-- Process sci data
-- Process code data
-- Data process hpo
+- Data mixture (`data_mixture`)
+  - This demo selects and mixes samples from multiple datasets and exports them into a new dataset.
diff --git a/demos/README_ZH.md b/demos/README_ZH.md
@@ -16,30 +16,35 @@ streamlit run app.py
 - 数据集样例 (`data`)
   - 该文件夹包含一些样例数据集。
 
+- 初探索 (`overview_scan`)
+  - 该示例介绍了 Data-Juicer 的基本概念和功能，例如特性、配置系统，算子等等。
+
 - 数据处理回路 (`data_process_loop`)
   - 该示例用来分析和处理数据集，并给出处理前后数据集的统计信息比对。
 
 - 词法多样性可视化 (`data_visualization_diversity`)
-  - 该示例可以用来分析 SFT 数据集的动词-名词结构, 并绘制成sunburst层级环形图表。
+  - 该示例可以用来分析 SFT 数据集的动词-名词结构，并绘制成sunburst层级环形图表。
 
 - 算子效果可视化 (`data_visualization_op_effect`)
   - 该示例可以分析数据集的统计信息，并根据这些统计信息可以显示出每个 `Filter` 算子在不同阈值下的效果。
 
 - 统计信息可视化 (`data_visualization_statistics`)
-  - 示例可以分析数据集，并获得多达13种统计信息。
+  - 该示例可以分析数据集，并获得多达13种统计信息。
+
+- 处理 SFT 中文数据 (`process_sft_zh_data`)
+  - 以 Alpaca-CoT 的部分中文数据为例，演示了 LLM 中指令跟随微调数据和有监督微调数据的分析和处理流程。
+
+- 处理预训练科学文献类数据 (`process_sci_data`)
+  - 以 arXiv 的部分数据为例，演示了如何处理 LLM 预训练中的科学文献类数据的分析和处理流程。
+
+- 处理预训练代码类数据 (`process_code_data`)
+  - 以 Stack-Exchange 的部分数据为例，演示了如何处理 LLM 预训练中的代码类数据的分析和处理流程。
 
 - 文本质量打分器 (`tool_quality_classifier`)
-  - 该示例提供了3种文本质量打分器， 对数据集进行打分评估。
+  - 该示例提供了3种文本质量打分器，对数据集进行打分评估。
 
 - 按语言分割数据集 (`tool_dataset_splitting_by_language`)
   - 该示例按照语言将数据集拆分为不同的子数据集。
 
-## 即将上线的的演示
-- Overview scan ｜ 初体验
-- Auto evaluation helm ｜ 自动HELM评测
-- Data mixture  ｜ 数据混合
-- SFT data zh   ｜ 中文指令微调数据处理
-- Process sci data ｜ 科学文献数据处理
-- Process code data ｜ 代码数据处理
-- Data process hpo  ｜ 数据混合超参自动优化
-
+- 数据混合 (`data_mixture`)
+  - 该示例从多份数据集中进行采样并混合为一个新的数据集。
diff --git a/demos/data_mixture/app.py b/demos/data_mixture/app.py
@@ -0,0 +1,123 @@
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+
+from data_juicer.format import load_formatter
+
+if st.__version__ >= '1.23.0':
+    data_editor = st.data_editor
+else:
+    data_editor = st.data_editor.experimental_data_editor
+
+
+@st.cache_data
+def convert_csv(df):
+    # IMPORTANT: Cache the conversion to prevent computation on every rerun
+    return df.to_csv(encoding='utf_8_sig').encode('utf-8')
+
+
+@st.cache_data
+def convert_jsonl(df):
+    # IMPORTANT: Cache the conversion to prevent computation on every rerun
+    return df.to_json(orient='records', lines=True,
+                      force_ascii=False).encode('utf-8')
+
+
+class Visualize:
+
+    @staticmethod
+    def setup():
+        st.set_page_config(
+            page_title='Data-Juicer',
+            page_icon=':smile',
+            layout='wide',
+            # initial_sidebar_state="expanded",
+        )
+
+        readme_link = 'https://github.com/alibaba/data-juicer'
+        st.markdown(
+            '<div align = "center"> <font size = "70"> Data-Juicer \
+            </font> </div>',
+            unsafe_allow_html=True,
+        )
+        st.markdown(
+            f'<div align = "center"> A Data-Centric Text Processing System for \
+                Large Language Models, \
+                see more details in our <a href={readme_link}>page</a></div>',
+            unsafe_allow_html=True,
+        )
+
+    @staticmethod
+    def mix_dataset():
+
+        data_files = list(Path('./data').glob('*jsonl'))
+
+        data_files_dict = {file.stem: str(file) for file in data_files}
+        col1, col2 = st.columns(2)
+        all_selected = []
+        with col1:
+            col3, col4 = st.columns(2)
+            with col3:
+                st.subheader('Select datasets')
+                options = sorted(list(data_files_dict.keys()))
+                selected_ds = st.multiselect(label='datasets',
+                                             options=options,
+                                             label_visibility='hidden')
+                for ds in selected_ds:
+                    all_selected.append({'dataset': ds, 'weight': 1.0})
+            with col4:
+                st.subheader('Select sampling method')
+                options = ['Random']
+                st.selectbox(label='method',
+                             options=options,
+                             label_visibility='hidden')
+
+            st.subheader('Set weight (0.0-1.0)')
+            datasets = data_editor(all_selected, use_container_width=True)
+            ds_names = [ds['dataset'] for ds in datasets]
+            ds_files = [data_files_dict[ds['dataset']] for ds in datasets]
+            weights = [ds['weight'] for ds in datasets]
+        with col2:
+            st.subheader('Show selected dataset details')
+            display_select = st.checkbox('Display')
+            if display_select:
+                if len(datasets) > 0:
+                    tabs = st.tabs(ds_names)
+                    for tab, ds_file in zip(tabs, ds_files):
+                        with tab:
+                            st.write(pd.read_json(ds_file, lines=True))
+
+        start_btn = st.button('Start to mix datasets', use_container_width=True)
+        if start_btn:
+            if len(datasets) > 0:
+                data_path = ' '.join([
+                    ' '.join([str(weight), ds_file])
+                    for ds_file, weight in zip(ds_files, weights)
+                ])
+                formatter = load_formatter(data_path)
+                df = pd.DataFrame(formatter.load_dataset())
+
+                st.session_state.dataset = df
+            else:
+                st.warning('Please select one dataset at least')
+
+        dataset = st.session_state.get('dataset', pd.DataFrame())
+        st.subheader('Mixed dataset')
+        st.dataframe(dataset, use_container_width=True)
+        st.download_button(label='Download mixed dataset as JSONL',
+                           data=convert_jsonl(dataset),
+                           file_name='mixed_dataset.jsonl')
+
+    @staticmethod
+    def visualize():
+        Visualize.setup()
+        Visualize.mix_dataset()
+
+
+def main():
+    Visualize.visualize()
+
+
+if __name__ == '__main__':
+    main()