Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 8

Proceedings of the 2016 Federated Conference on Computer Science and Information Systems

Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora.

, ,

DOI: http://dx.doi.org/10.15439/2016F304

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 517525 ()

Full text

Abstract. - In contemporary world, translation becomes a critical need of the time. Parallel dictionaries have now become a most accessible source by humans, but confines are there as they do not offer good quality translation function, because of neologisms and words that are out of vocabulary. To overcome this problem in the usage of statistical translation systems is becoming more and more important in maintaining the eminence and quantity of the training data. But due to the limitations in these systems they have very limited availability for few languages and very limited narrow text areas. The purpose of this research is to bring calculation time up gradation via GPU acceleration, tuning script introduction and the enhancement and improvements in the methodologies of the contemporary comparable corpora mining through re-implementation of analogous algorithms through Needleman-Wunch algorithm. Experiments have been conducted on multiple language data which were extracted on numerous domains from Wikipedia. For the sake of Wikipedia, multiple cross-lingual contrasts and comparison were established. Optimistic impact on the both quantity and quality of mined data was observed due to such changes and adaptation. The solution is language independent and highly practical especially for under-resourced languages.


  1. K. Wołk, K. Marasek. „Real-Time Statistical Speech Translation.” In: New Perspectives in Information Systems and Technologies, Volume 1. Springer International Publishing, 2014, p. 107-113. http://dx.doi.org/10.1007/978-3-319-05951-8_11
  2. K. Wołk, K. Marasek. „Polish–English Speech Statistical Machine Translation Systems for the IWSLT 2013”. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany. 2013, p. 113-119. http://dx.doi.org/10.13140/RG.2.1.1128.9204
  3. A. Haghighi et al. “Better word alignments with supervised ITG models.” In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2. Association for Computational Linguistics, 2009, p. 923931.
  4. P. Koehn. „Statistical machine translation.” Cambridge University Press, 2009. http://dl.acm.org/citation.cfm?doid=1380584.1380586
  5. G. Berrotarán, R. Carrascosa, A. Vine „Yalign documentation”, https://yalign.readthedocs.org - accessed 01/2015
  6. R. Dieny, J. Thevenon, J. Martinez-Delrincon, J. C. Nebel. „Bioinformatics inspired algorithm for stereo correspondence.” International Conference on Computer Vision Theory and Applications, March 5–7, Vilamoura - Algarve, Portugal, 2011.
  7. G. Musso. „Sequence alignment (Needleman-Wunsch, Smith- Waterman)”, http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf.
  8. M. Cettolo, C. Girardi, M. Federico. “Wit3: Web inventory of transcribed and translated talks.” In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT). 2012, p. 261-268.
  9. M. Mohammadi; N. Ghasemaghaee. „Building bilingual parallel corpora based on Wikipedia.” In: Computer Engineering and Applications (ICCEA), 2010 Second International Conference on. IEEE, 2010, p. 264-268. http://dx.doi.org/10.1109/ICCEA.2010.203
  10. F. M. Tyers, J. A. Pienaar. „Extracting bilingual word pairs from Wikipedia”, Collaboration: interoperability between people in the creation of language resources for less-resourced languages 19, 2008, p. 19-22.
  11. J. R. Smith, C. Quirk, K. Toutanova. „Extracting parallel sentences from comparable corpora using document level alignment.” In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, p. 403-411.
  12. K. Yasuda, E. Sumita. „Method for building sentence-aligned corpus from wikipedia”. In: 2008 AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI08), 2008, p.263-268.
  13. S. Pal, P. Pakray, S. K. Naskar. “Automatic Building and Using Parallel Resources for SMT from Comparable Corpora.” In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, 2014. p. 48-57.
  14. M. Plamada, M. Volk. “Mining for Domain-specific Parallel Text from Wikipedia.” Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, 2013, p.112-120. http://dx.doi.org/10.5167/uzh-80043
  15. A. Aker, E. Kanoulas, R.J. Gaizauskas. “A light way to collect comparable corpora from the Web”. In: LREC, 2012, p. 15-20.
  16. J. Strötgen, M. Gertz, C. Junghans. “An event-centric model for multilingual document similarity.” In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011, p. 953-962. http://dx.doi.org/10.1145/2009916.2010043
  17. M.L. Paramita et al. “Methods for collection and evaluation of comparable documents.” In: Building and Using Comparable Corpora. Springer Berlin Heidelberg, 2013, p. 93-112. http://dx.doi.org/10.1007/978-3-642-20128-8_5
  18. D. Wu, P. Fung. “Inversion transduction grammar constraints for mining parallel sentences from quasicomparable corpora.” In: Natural Language Processing– IJCNLP 2005. Springer Berlin Heidelberg, 2005, p. 257-268. http://dx.doi.org/10.1007/11562214_23
  19. J.H. Clark et al. “Better hypothesis testing for statistical machine translation: Controlling for optimizer instability.” In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, 2011, p. 176181.
  20. S. Adafre; M. De Rijke. „Finding similar sentences across multiple languages in Wikipedia.” In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, p. 6269.
  21. K. Wołk, K. Marasek. “A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation.” In: New Perspectives in Information Systems and Technologies, Volume 1. Springer International Publishing, 2014, p. 229-237. http://dx.doi.org/10.1016/j.protcy.2014.11.024
  22. A. Axelrod, X. HE, J. Gao. “Domain adaptation via pseudo in- domain data selection.” In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, p. 355-362.
  23. K. Wołk, K. Marasek. “Tuned and GPU-accelerated parallel data mining from comparable corpora.” In: Text, Speech, and Dialogue. Springer International Publishing, 2015, p. 32-40. http://dx.doi.org/10.1007/978-3-319-24033-6_4
  24. C. S. Khaladkar. “An Efficient Implementation of Needleman Wunsch Algorithm on Graphical Processing Units”, PHD Thesis, School of Computer Science and Software Engineering, The University of Western Australia, 2009.
  25. https://github.com/machinalis/yalign/issues/3 accessed 10.11.2015
  26. R. Roessler. “A GPU implementation of NeedlemanWunsch, specifically for use in the program pyronoise 2.” Computer Science & Engineering, 2010.
  27. T. Joachims. “Text categorization with support vector machines: Learning with many relevant features.” Lecture Notes in Computer Science vol 1398, 2005, p. 137-142. . http://dx.doi.org/10.1007/BFb0026683