Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building…

Search automation in Google Translate using basic Python

--

link to this article Portuguese translation here

INTRODUCTION

Lately, when I needed to write Python code to translate texts to a foreign language, I turned myself to the Googletrans library, which is quite easy to use and is fed by the Google Translate API with no restrictions. However, this library has been presenting a serious bug for the last few months (one can check more about this subject on this link), which makes it impossible to use such a library nowadays.

Although there are other Python package options to address this same demand, I decided to use this small setback as a learning opportunity. Since Googletrans is not working properly, would it be possible to write a web scraping mini program in order to automate the search for translations directly on the Google Translate website? I gladly discovered that the answer to this question is positive and that such an endeavor is possible using only the open function from the built-in module webbrowser.

Therefore, this article presents the final results of this web scraping work of mine, which was inspired by the mapIt.py program, presented in chapter 12 of the great book Automate The Boring Stuff With Python (link to the complete text here). Its author, Al Sweigart, proposes a remarkably simple solution to automate address search on Google Maps. Surely enough, since I have already read this work and practiced its examples (as well as due to my previous familiarity with the Googletrans library), it was relatively easy to discover the Google Translate URL pattern and thus write my own code.

HOW TO TRANSLATE A TEXT USING THE WEBBROWSER LIBRARY

We will access the Google Translate website (https://translate.google.com/) to make a short translation from English to Portuguese. Next, we will analyze the information provided by the webpage, especially the new URL, presented after the site translates our text:

Here, a (not so) important warning: the comma in the “Hello, World!” statement was intentionally withdrawn for didactic purposes (although there are discussions about whether its use is required, as shown here)

As one can see in the screen reproduced above, there are four parameters appended to the URL, which are separated by three ampersand symbols:

· sl=en

&

· tl=pt

&

· text=Hello%20World!

&

· op=translate

Also, note the question mark before the first parameter.

It is highly likely that the sl and tl acronyms mean, respectively, source language and target language. On the other hand, en (the sl parameter value) is the English language acronym, while pt is the one for Portuguese. The text parameter value contains the text to be translated. Finally, op (that probably means “operation”) has translate as its value.

Another aspect to consider in the analyzed URL structure is the blank space between the words “Hello” and “World!”. As one can notice, it was replaced by the sequence %20. In fact, this is not the only character that is translated in a special way when used in the Google Translate URL. Also, the comma, for example, is transcribed in the URL by the sequence %2C, which makes a comma followed by a blank space to be represented as %2C%20. You can try yourself other special characters in your translation in order to check whether they are translated or not.

We are now ready to analyze the initial code below:

As it was mentioned before, the code simplicity is patent: it is constituted only by one f-string representing the final URL, with the interpolation of the parameter values ​​in their correct places. In the sequence, this URL is accessed by the open function from the webbrowser module.

However, when we run the code as presented, some small problems happen, especially the fact that the text formatting in verses is not reproduced on the Google Translate website to the destination text. This occurs because the line break characters (the famous \n) are not automatically identified when we do not give a treatment to the source text and simply use it directly in the URL. Only blank spaces will be converted in this situation.

To work around this problem, I created a small custom function, which performs the replacement of certain characters by its corresponding symbols in the URL. The complete code, with the addition of this function, is reproduced below:

Generalizing the code with input from a .txt file

Now, let us turn the main body of our code into a function, named open_google_trans, which will receive the sl, st and text parameters. We will also add the dunder name, dunder main code at the end of the file:

The last modification we will make in our code, for now, will be the addition of a feature that allows us to upload a .txt file with texts to be translated. In order to do that, we will use the textToTranslate.txt file, available at my repository on Github. Here, the idea is to translate the original prologue from Romeo and Juliet play to seven languages: Portuguese (pt), Spanish (es), Esperanto (eo), Latin (la), Turkish (tr), Korean (ko), and Japanese (ja). For each of these languages, a different tab will be open in the browser, with the respective translation for the target language.

For other language codes, you may access the site https://cloud.google.com/translate/docs/languages

You can download the .txt file directly to your machine, or you can still adapt the code with a few lines from the Requests library, which will download the file directly from my repository onGitHub. I will add in the code this adaptation with the use of the Requests library (which is easily installed via pip install requests). I also have made the code modifications to translate the .txt file to the aforementioned seven languages , with a 5 second pause between each new translation. The program final version is presented below.

An important observation about a limitation of the code presented here: Google Translate restricts to 5k characters the source language text. Thus, if there is a need to translate larger texts, one should adapt the code for this specific purpose.

Finally, I would like to make an extra-code comment here. It should be noted that the translations provided by Google Translate should always be reviewed by the user if she/he has the skills to do so. Indeed, the translation quality varies a lot from language to language. Concerning Latin, for example, translations usually present serious and even basic problems, to such a point that, in my opinion, the Google service for that language should be avoided. As a former Latin professor at a Federal University in Brazil, I have seen more than one gullible student having trouble for trying to present me a disconnected translation as his/her homework, copied directly from Google Translate. As a matter of fact, ten years have already transpired since these events and Google’s translations using Latin still have a lot of space to improvement…

On my next article, I’m going to explore how, using the Selenium library, one can save the translation generated by Google Translate.

Thank you so much for having honored my text with your reading.

Happy coding!

P.S. (23/mar/2021): I have just found out today that googletrans has released a temporary fix for the AttributeError: ‘NoneType’ object has no attribute ‘group’. That is great news! I tested here and it worked just fine. In order to use it, you need to uninstall the current googletrans version and install the specific version with the temporary fix, as shown below:

You can find more info about that in this Github link:

https://github.com/ssut/py-googletrans/issues/280

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Fabrício Barbacena
Fabrício Barbacena

Written by Fabrício Barbacena

Python and Django Developer • Data Analyst • BI Consultant • Data Science • Data Engineering • https://linktr.ee/fabriciobarbacena

No responses yet