杨文龙的博客 » Blog Archive » 【转】Moses中模型训练的并行化问题 - 个人技术笔记 - 热爱创新发明，专注机器学习、算法、深度学习、自然语言处理及人工智能

【转】Moses中模型训练的并行化问题

发布时间：2015-11-18 栏目：机器翻译, 自然语言处理评论：0 Comments

Moses的操作步骤比较简单，具体可以参考Moses的官网，这里只转帖一篇关于操作的，其他主要侧重算法原理：

众所周知，在Moses中除了语言模型的训练是利用srilm的ngram-count模块单独训练外，其它模型的训练都是利用train-factored-phrase-model.perl模型训练脚本进行一站式训练，包括翻译模型，重排序模型和生成模型等，这些模型的训练只要设置好相应的选项，中间过程无需人工干预，在整个训练过程结束后，就可以生成解码器所需的所有文件，极为方便。
训练脚本包括以下9个步骤，有些步骤是根据用户的设置进行的，没有的就跳过，但是最基本的翻译模型训练是肯定的：
(1) 准备语料（prepare corpus）
(2) 运行Giza++(run GIZA)
(3) 词语对齐（align words）
(4) 词典概率评分，既利用MLE计算词语的翻译概率（learn lexical translation）
(5) 短语抽取（extract phrases）
(6) 短语评分，既生成phrase-table,也就是翻译模型（score phrases）
(7) 训练重排序模型（learn reordering model）
(8) 训练生成模型（learn generation model）
(9)创建解码器所需的相应配置文件（create decoder config file）
这9个步骤中，最最耗时的就算运行Giza++了，在moses的参考文档中指出：
GIZA++ is a freely available implementation of the IBM Models. We need it as a initial step to establish word alignments. Our word　alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs.
Running GIZA++ is the most time consuming step in the training process. It also requires a lot of memory (1-2 GB RAM is common for large parallel corpora).
Giza++是从双语句对的两个方向进行迭代学习的，模型训练的绝大多数时间都被这两个方向的迭代训练占去了，所以，Moses提供了第一种简单的解决办法：并行训练（Training in parallel):
在训练时加上 –parallel选项，这样训练脚本将被fork，Giza++的两个方向的训练将作为独立的进程。这是一台多处理器机器上的最好选择。
Using the –parallel option will fork the script and run the two directions of GIZA as independent processes. This is the best choice on a multi-processor machine.
如果你想在单处理器上使用并行运行两个Giza,可以使用下面的方法（我觉得单处理器上没必要使用这个方法，即使使用了效果也应该和正常的方法一样，时间节省效果不明显）：
First you start training the usual way with the additional switches –last-step 2 –direction 1, which runs the data perparation and one direction of GIZA++ training．When the GIZA++ step started, start a second training run with the switches –first-step 2 –direction 2. This runs the second GIZA++ run in parallel, and then continues the rest of the model training. (Beware of race conditions! The second GIZA might finish earlier than the first one to training step 3 might start too early!)
Moses本身提供的方法可以有效利用2核处理器，但是对于更多核的处理器机器，譬如我的机器是4核cpu的，通过观察发现，训练过程中任何时刻，只有两个处理器的使用率是百分百，其他两个cpu基本闲置。对于闲置的cpu,不用似乎是一种浪费，不过这个问题已经有了解决: Mgiza++。
MGiza++是在Giza++基础上扩充的一中多线程Giza++工具，描述如下：
Multi Thread GIZA++ is an extension to GIZA++ word aligning tool. It can perform much faster training than origin GIZA++ if you have more than one CPUs, in addition it fixed some bugs in GIZA, and the final aligning perplexity is generally lower than original GIZA++.
在使用MGiza++时，可以根据自己的机器指定使用几个处理器，非常方便，需要说明的是，MGzia++的作者是一位中国人，他同时提供了Giza++并行化的另一个工具：PGIZA++ 。Pgiza++是运行在分布式机器上的Giza++工具，使用了MapReduce技术的框架：
PGIZA++ is another version that can run on cluters. PGIZA++ based on schedulers of the cluster, here we have script for maui, Condor and simply ssh remote procedure call.
关于MGiza++，PGiza++两个工具的主页：
http://www.cs.cmu.edu/~qing/
参考的三篇文献，这些在Google上很容易搜到：
1、Parallelizing the Training Procedure of Statistical Phrase-based Machine Translation
2、Parallel Implementations of Word Alignment Tool
3、Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce

留下评论

You must be logged in to post a comment.

相册集

关于自己

杨文龙，微软Principal Engineering Manager, 曾在各家公司担任影像技术资深总监、数据科学团队资深经理、ADAS算法总监、资深深度学习工程师等职位，热爱创新发明，专注于人工智能、深度学习、图像处理、机器学习、算法、自然语言处理及软件等领域，目前发明有国际专利19篇，中国专利28篇。

联系我

个人技术笔记

welonshen@gmail.com

2015 in Shanghai

个人技术笔记

【转】Moses中模型训练的并行化问题

留下评论

近期文章

近期评论

文章归档

分类目录

功能

热情读者

作者其他介绍

友情链接

相册集

关于自己

联系我