Faqts : Business : Programming : Shopping For You : Python : Snippets

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

12 of 18 people (67%) answered Yes
Recently 5 of 10 people (50%) answered Yes

Entry

Language detection module

Dec 18th, 2009 17:00
new acct, Shopping Snooper, justin hamp, Adam John, joh kong, engatoo engatoo, osososo didididi, Nathan Wallace, Hans Nowak, Snippet 192, Dinu C. Gherman


"""
Packages: text;miscellaneous.mundane
"""

"""
Is there anything already like a function that I can pass an
arbitrary string and it will tell me wether it is written in
English, French, German, etc.? 

I imagine this could be rather simply implemented with some 
dicts containing common prefixes and suffixes (as well as most 
often used words like 'you', 'me', etc.) used in the respective 
natural language. One could then calculate some likelihood for 
the text to be in any of these languages classes and then return
the most likely one, or a list of them. I'm not sure, though,
how accents would be represented in a "portable" way (accross
multiple platforms), maybe in HTML...?

While writing this I started a small experiment to code what
I think about. Below you'll find what came out of it. Is there
anything more sophisticated out there, including a better scor-
ing/weighting method, maybe also for combinations of words, even 
handling accents, perhaps?
"""

# langdetect.py -- Detect a natural language of a written text.

import string

en, fr, de = 'en', 'fr', 'de'

wordDict = {
    'i':en, 'you':en, 'me':en, 'the':en, 'a':en, 
    'moi':fr, 'je':fr, 'toi':fr, 'vouz':fr, 'sur':fr, 'en':fr,
    'sie':de, 'ich':de, 'um':de, 'an':de, 'ab':de}

prefixDict = {
    'off':en, 'to':en, 'under':en, 'in':en, 'thou':en,
    'mont':fr, 'contr':fr, 'mal':fr,
    'ver':de, 'zu':de, 'los':de, 'gut':de}

suffixDict = {
    'son':en, 'day':en, 'ing':en, 'ly':en, 'ght':en,
    'ique':fr, 'tude':fr, 'ont':fr, 'nal':fr,
    'tung':de, 'heim':de, 'zeug':de}

punct = """.,!?"()[]{}!§$%&/*+#"""
trans = string.maketrans(punct, ' '*len(punct))


def detectLanguage(input):
    inp0 = input.lower()
    inp1 = inp0.translate(trans)
    inp2 = inp1.strip()
    inp3 = inp2.split(' ')

    res = {en:0, fr:0, de:0}
    explain = {en:[], fr:[], de:[]}

    for word in inp3:
        try :
            v = wordDict[word]
            res[v] += 1
            explain[v].append(word)
        except KeyError:
            pass

        for p in prefixDict:
            try:
                wp = word[:len(p)]
                if p == wp:
                    prefixDict[wp]
                    res[v] += 1
                    explain[v].append(word)
            except KeyError:
                pass

        for s in suffixDict:
            try:
                ws = word[-len(s):]
                if s == ws:
                    suffixDict[ws]
                    res[v] += 1
                    explain[v].append(word)
            except KeyError:
                pass

    return res, explain


for phrase in ("I am in a good mood today.", 
        "Je suis en plaine forme.",
        "Ich bin heute gut drauf."):
    result, explain = detectLanguage(phrase)
    print "Input:", phrase
    print "Hypothesis:", result       
    print "Reasons:", explain
    print


# Should print something like this:
#
# Input: I am in a good mood today.
# Hypothesis: {'en': 5, 'fr': 0, 'de': 0}
# Reasons: {'en': ['i', 'in', 'a', 'today', 'today'], 
#           'fr': [], 
#           'de': []}
#
# Input: Je suis en plaine forme.
# Hypothesis: {'en': 0, 'fr': 2, 'de': 0}
# Reasons: {'en': [], 'fr': ['je', 'en'], 'de': []}
#
# Input: Ich bin heute gut drauf.
# Hypothesis: {'en': 0, 'fr': 0, 'de': 2}
# Reasons: {'en': [], 'fr': [], 'de': ['ich', 'gut']}