Online Shopping : Computers : Programming : Languages : Python : Snippets

+ Search
Add Entry AlertManage Folder Edit Entry Add page to
Did You Find This Entry Useful?

11 of 17 people (65%) answered Yes
Recently 5 of 10 people (50%) answered Yes


Language detection module

Apr 2nd, 2009 06:50
engatoo engatoo, osososo didididi, Nathan Wallace, Hans Nowak, Snippet 192, Dinu C. Gherman

Packages: text;miscellaneous.mundane
Is there anything already like a function that I can pass an
arbitrary string and it will tell me wether it is written in
English, French, German, etc.? 
I imagine this could be rather simply implemented with some 
dicts containing common prefixes and suffixes (as well as most 
often used words like 'you', 'me', etc.) used in the respective 
natural language. One could then calculate some likelihood for 
the text to be in any of these languages classes and then return
the most likely one, or a list of them. I'm not sure, though,
how accents would be represented in a "portable" way (accross
multiple platforms), maybe in HTML...?
While writing this I started a small experiment to code what
I think about. Below you'll find what came out of it. Is there
anything more sophisticated out there, including a better scor-
ing/weighting method, maybe also for combinations of words, even 
handling accents, perhaps?
# -- Detect a natural language of a written text.
import string
en, fr, de = 'en', 'fr', 'de'
wordDict = {
    'i':en, 'you':en, 'me':en, 'the':en, 'a':en, 
    'moi':fr, 'je':fr, 'toi':fr, 'vouz':fr, 'sur':fr, 'en':fr,
    'sie':de, 'ich':de, 'um':de, 'an':de, 'ab':de}
prefixDict = {
    'off':en, 'to':en, 'under':en, 'in':en, 'thou':en,
    'mont':fr, 'contr':fr, 'mal':fr,
    'ver':de, 'zu':de, 'los':de, 'gut':de}
suffixDict = {
    'son':en, 'day':en, 'ing':en, 'ly':en, 'ght':en,
    'ique':fr, 'tude':fr, 'ont':fr, 'nal':fr,
    'tung':de, 'heim':de, 'zeug':de}
punct = """.,!?"()[]{}!§$%&/*+#"""
trans = string.maketrans(punct, ' '*len(punct))
def detectLanguage(input):
    inp0 = string.lower(input)
    inp1 = string.translate(inp0, trans)
    inp2 = string.strip(inp1)
    inp3 = string.split(inp2, ' ')
    res = {en:0, fr:0, de:0}
    explain = {en:[], fr:[], de:[]}
    for word in inp3:
        try :
            v = wordDict[word]
            res[v] = res[v] + 1
        except KeyError:
        for p in prefixDict.keys():
                wp = word[:len(p)]
                if p == wp:
                    res[v] = res[v] + 1
            except KeyError:
        for s in suffixDict.keys():
                ws = word[-len(s):]
                if s == ws:
                    res[v] = res[v] + 1
            except KeyError:
    return res, explain
for phrase in ("I am in a good mood today.", 
        "Je suis en plaine forme.",
        "Ich bin heute gut drauf."):
    result, explain = detectLanguage(phrase)
    print "Input:", phrase
    print "Hypothesis:", result       
    print "Reasons:", explain
# Should print something like this:
# Input: I am in a good mood today.
# Hypothesis: {'en': 5, 'fr': 0, 'de': 0}
# Reasons: {'en': ['i', 'in', 'a', 'today', 'today'], 
#           'fr': [], 
#           'de': []}
# Input: Je suis en plaine forme.
# Hypothesis: {'en': 0, 'fr': 2, 'de': 0}
# Reasons: {'en': [], 'fr': ['je', 'en'], 'de': []}
# Input: Ich bin heute gut drauf.
# Hypothesis: {'en': 0, 'fr': 0, 'de': 2}
# Reasons: {'en': [], 'fr': [], 'de': ['ich', 'gut']}