Is there a tutorial about giza++? [closed]

2023-02-28 14:18 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answ开发者_StackOverflow社区ers.

Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 8 years ago.

Improve this question

The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation?

The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)

Start with two data files containing parallel sentences that have been tokenized, one sentence per line. For example, a pair of parallel English-French files might read as follows.

Sample 1 - train.en

I gave him the book . 
He read the book . 
He loved the book .

Sample 2 - train.fr

Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .

Run these files through plain2snt.out to get target and source vocabulary files (*.vcb) as well as a sentence pair file (*.snt).

From the GIZA++ directory, run:

./plain2snt.out TEXT1 TEXT2

where TEXT1 and TEXT2 are the data files described in step 1.

This produces four files in the same directory as TEXT1 and TEXT2 (assuming they are in the same directory):

TEXT1_TEXT2.snt
TEXT1.vcb
TEXT2_TEXT1.snt
TEXT2.vcb

The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.

The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for *.snt files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the file TEXT1_TEXT2.snt, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in the TEXT1.vcb file, and the third line will be a string of numbers corresponding to words in the TEXT2.vcb file.

Now TEXT1.vcb, TEXT2.vcb, and either of the two *.snt files can be used as input to GIZA++ to produce an alignment.

For example:

./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt

But note that when I tried to run this, I had to rename TEXT1_TEXT2.snt to something without an underscore in the name in order to get any proper output.

This Powerpoint tutorial worked for me: http://www.tc.umn.edu/~bthomson/wordalignment/GIZA.ppt

This one is very helpful : http://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/

IIT-B scholars have put up nice and detailed presentations for GIZA++ and MOSES setup and use.

Some of them are : http://www.cse.iitb.ac.in/~pb/cs712-2013/potpouri/kashyap-giza-mozes-jan2013.pdf

http://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf

http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf

This one maybe ?

http://code.google.com/p/giza-pp/issues/attachmentText?id=8&aid=697742396599277757&name=README-rst&token=40fba3d449abc12366b98b04cfe7dbc1

Full source : http://code.google.com/p/giza-pp/issues/detail?id=8

There is a supplemental explanation of how to format input files and how to run GIZA++ over here:

http://www.tc.umn.edu/~bthomson/wordalignment/GIZAREADME.txt

继续阅读：giza++machine-translation

Is there a tutorial about giza++? [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？