Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

how to install it on multi nodes for distributed training? #60

Open
adrianhust opened this issue Nov 11, 2017 · 10 comments
Open

how to install it on multi nodes for distributed training? #60

adrianhust opened this issue Nov 11, 2017 · 10 comments

Comments

@adrianhust
Copy link

adrianhust commented Nov 11, 2017

I have tried several times to install it in multiple nodes, but failed, suceeded on single machine;
anyone can give a detailed guide for this?
Neither mpich or zeromq works for me.

any hints, thank you!

@adrianhust adrianhust changed the title how to install it on multi node for distributed training? how to install it on multi nodes for distributed training? Nov 11, 2017
@1234clam
Copy link

I think you should just install on all the nodes and run with a machine list file.

@adrianhust
Copy link
Author

Thank you for your reply, but that not works for me, lightlda depends on mpich or zeromp; I installed on two nodes, but they cannot communicate with machine list file.

@1234clam
Copy link

SSH can login without password ?

@xiaomiao91
Copy link

Is there some examples of distributed training about how to config, I searched and could not find the result.

@1234clam
Copy link

@xiaomiao91 I don't find any examples of distributed training about how to config I just train the nytimes data set in the example provide by the project.

@chivee
Copy link

chivee commented Nov 15, 2017

@1234clam , @adrianhust , does mpirun -n 2 works on two nodes? if so, please ensure the mpirun is added to the ssh env

https://www.open-mpi.org/faq/?category=running may help to you

@xiaomiao91
Copy link

Hi,
I execute this commend on server 10.210.228.70.
mpiexec -machinefile machine_list -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir /data3/ad_dm/zemin/LightLDA/example/data/nytimes -data_capacity 800
It ran the nytime example on two node successfully, my machinefile like this

10.210.228.70
10.210.228.64

Now I want run my formal data on more node, should I split my big data to every node of the cluster and only execute above commend? or what else should I do or take case ?
Thanks

@1234clam
Copy link

1234clam commented Nov 16, 2017

@xiaomiao91 这个肯定只要分开拷贝到其他机器上就行了的呀~ 虽然我觉得这个设计有点麻烦,是转成libsvm之后对libsvm格式的文档进行拆分就可以了。还是中文比较好用。

@xiaomiao91
Copy link

谢谢啊,我试试😊

@Abigale001
Copy link

$ mpiexec then output:

The program 'mpiexec' can be found in the following packages:

  • lam-runtime
  • mpich
  • openmpi-bin
    Try: sudo apt install selected package

It is weird because I have run make install and so on just as the build.sh shows.

Anyone could help?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants