Skip to content

Commit

Permalink
* add more demos (#12)
Browse files Browse the repository at this point in the history
  • Loading branch information
zhijianma authored Aug 10, 2023
1 parent 7a06df6 commit d4ab729
Show file tree
Hide file tree
Showing 20 changed files with 1,413 additions and 22 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
- Data Processing:
- Scientific Literature (e.g. [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
- Programming Code (e.g. [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
- Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
- Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
- Tool Pool:
- Dataset Splitting by Language [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
- Quality Classifier for CommonCrawl [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]
Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
* 数据处理:
* 科学文献 (例如 [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
* 编程代码 (例如 [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
* 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
* 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
* 工具池:
* 按语言分割数据集 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
* CommonCrawl 质量分类器 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]
Expand Down
22 changes: 14 additions & 8 deletions demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ streamlit run app.py
- Data (`data`)
- This folder contains some sample datasets.

- Overview scan (`overview_scan`)
- This demo introduces the basic concepts and functions of Data-Juicer, such as features, configuration, operators, and so on.

- Data process loop (`data_process_loop`)
- This demo analyzes and processes a dataset, providing a comparison of statistical information before and after the processing.

Expand All @@ -28,17 +31,20 @@ streamlit run app.py
- Data visualization statistics (`data_visualization_statistics`)
- This demo analyzes the dataset and obtain up to 13 statistics.

- Process SFT Chinese data (`process_sft_zh_data`)
- This demos analyzes and processes part of Chinese dataset in Alpaca-CoT to show how to process IFT or SFT data for LLM fine-tuning.

- Process SCI data (`process_sci_data`)
- This demos analyzes and processes part of arXiv dataset to show how to process scientific literature data for LLM pre-training.

- Process code data (`process_code_data`)
- This demos analyzes and processes part of Stack-Exchange dataset to show how to process code data for LLM pre-training.

- Text quality classifier (`tool_quality_classifier`)
- This demo provides 3 text quality classifier to score the dataset.

- Dataset splitting by language (`tool_dataset_splitting_by_language`)
- This demo splits a dataset to different sub-datasets by language.

## Demos Coming Soon
- Overview scan
- Auto evaluation helm
- Data mixture
- SFT data zh
- Process sci data
- Process code data
- Data process hpo
- Data mixture (`data_mixture`)
- This demo selects and mixes samples from multiple datasets and exports them into a new dataset.
29 changes: 17 additions & 12 deletions demos/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,30 +16,35 @@ streamlit run app.py
- 数据集样例 (`data`)
- 该文件夹包含一些样例数据集。

- 初探索 (`overview_scan`)
- 该示例介绍了 Data-Juicer 的基本概念和功能,例如特性、配置系统,算子等等。

- 数据处理回路 (`data_process_loop`)
- 该示例用来分析和处理数据集,并给出处理前后数据集的统计信息比对。

- 词法多样性可视化 (`data_visualization_diversity`)
- 该示例可以用来分析 SFT 数据集的动词-名词结构, 并绘制成sunburst层级环形图表。
- 该示例可以用来分析 SFT 数据集的动词-名词结构并绘制成sunburst层级环形图表。

- 算子效果可视化 (`data_visualization_op_effect`)
- 该示例可以分析数据集的统计信息,并根据这些统计信息可以显示出每个 `Filter` 算子在不同阈值下的效果。

- 统计信息可视化 (`data_visualization_statistics`)
- 示例可以分析数据集,并获得多达13种统计信息。
- 该示例可以分析数据集,并获得多达13种统计信息。

- 处理 SFT 中文数据 (`process_sft_zh_data`)
- 以 Alpaca-CoT 的部分中文数据为例,演示了 LLM 中指令跟随微调数据和有监督微调数据的分析和处理流程。

- 处理预训练科学文献类数据 (`process_sci_data`)
- 以 arXiv 的部分数据为例,演示了如何处理 LLM 预训练中的科学文献类数据的分析和处理流程。

- 处理预训练代码类数据 (`process_code_data`)
- 以 Stack-Exchange 的部分数据为例,演示了如何处理 LLM 预训练中的代码类数据的分析和处理流程。

- 文本质量打分器 (`tool_quality_classifier`)
- 该示例提供了3种文本质量打分器, 对数据集进行打分评估。
- 该示例提供了3种文本质量打分器,对数据集进行打分评估。

- 按语言分割数据集 (`tool_dataset_splitting_by_language`)
- 该示例按照语言将数据集拆分为不同的子数据集。

## 即将上线的的演示
- Overview scan | 初体验
- Auto evaluation helm | 自动HELM评测
- Data mixture | 数据混合
- SFT data zh | 中文指令微调数据处理
- Process sci data | 科学文献数据处理
- Process code data | 代码数据处理
- Data process hpo | 数据混合超参自动优化

- 数据混合 (`data_mixture`)
- 该示例从多份数据集中进行采样并混合为一个新的数据集。
123 changes: 123 additions & 0 deletions demos/data_mixture/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from pathlib import Path

import pandas as pd
import streamlit as st

from data_juicer.format import load_formatter

if st.__version__ >= '1.23.0':
data_editor = st.data_editor
else:
data_editor = st.data_editor.experimental_data_editor


@st.cache_data
def convert_csv(df):
# IMPORTANT: Cache the conversion to prevent computation on every rerun
return df.to_csv(encoding='utf_8_sig').encode('utf-8')


@st.cache_data
def convert_jsonl(df):
# IMPORTANT: Cache the conversion to prevent computation on every rerun
return df.to_json(orient='records', lines=True,
force_ascii=False).encode('utf-8')


class Visualize:

@staticmethod
def setup():
st.set_page_config(
page_title='Data-Juicer',
page_icon=':smile',
layout='wide',
# initial_sidebar_state="expanded",
)

readme_link = 'https://github.com/alibaba/data-juicer'
st.markdown(
'<div align = "center"> <font size = "70"> Data-Juicer \
</font> </div>',
unsafe_allow_html=True,
)
st.markdown(
f'<div align = "center"> A Data-Centric Text Processing System for \
Large Language Models, \
see more details in our <a href={readme_link}>page</a></div>',
unsafe_allow_html=True,
)

@staticmethod
def mix_dataset():

data_files = list(Path('./data').glob('*jsonl'))

data_files_dict = {file.stem: str(file) for file in data_files}
col1, col2 = st.columns(2)
all_selected = []
with col1:
col3, col4 = st.columns(2)
with col3:
st.subheader('Select datasets')
options = sorted(list(data_files_dict.keys()))
selected_ds = st.multiselect(label='datasets',
options=options,
label_visibility='hidden')
for ds in selected_ds:
all_selected.append({'dataset': ds, 'weight': 1.0})
with col4:
st.subheader('Select sampling method')
options = ['Random']
st.selectbox(label='method',
options=options,
label_visibility='hidden')

st.subheader('Set weight (0.0-1.0)')
datasets = data_editor(all_selected, use_container_width=True)
ds_names = [ds['dataset'] for ds in datasets]
ds_files = [data_files_dict[ds['dataset']] for ds in datasets]
weights = [ds['weight'] for ds in datasets]
with col2:
st.subheader('Show selected dataset details')
display_select = st.checkbox('Display')
if display_select:
if len(datasets) > 0:
tabs = st.tabs(ds_names)
for tab, ds_file in zip(tabs, ds_files):
with tab:
st.write(pd.read_json(ds_file, lines=True))

start_btn = st.button('Start to mix datasets', use_container_width=True)
if start_btn:
if len(datasets) > 0:
data_path = ' '.join([
' '.join([str(weight), ds_file])
for ds_file, weight in zip(ds_files, weights)
])
formatter = load_formatter(data_path)
df = pd.DataFrame(formatter.load_dataset())

st.session_state.dataset = df
else:
st.warning('Please select one dataset at least')

dataset = st.session_state.get('dataset', pd.DataFrame())
st.subheader('Mixed dataset')
st.dataframe(dataset, use_container_width=True)
st.download_button(label='Download mixed dataset as JSONL',
data=convert_jsonl(dataset),
file_name='mixed_dataset.jsonl')

@staticmethod
def visualize():
Visualize.setup()
Visualize.mix_dataset()


def main():
Visualize.visualize()


if __name__ == '__main__':
main()
Loading

0 comments on commit d4ab729

Please sign in to comment.