"Theorizing from Data: Avoiding the Capital Mistake
Peter Norvig
""It is a capital mistake to theorize before one has data."" Sir Arthur Conan Doyle's words from 1891 remain true today. Researchers in computational linguistics and information retrieval now have a million times more data than was available 30 years ago. This talk explores what this data can do for problems in language understanding, translation, information extraction, and inference, and extrapolates to what more data may bring in the future. "
Channel: News & Politics Uploaded: June 5, 2007 at 1:32 pm Author:GoogleDeveloperDay
It still amazes me over and over again of how smart some people can be. I'm getting my Professional Bachelor of Informatics in 2 months and I feel really dumb compared to these people. But then again, they have their years of experience and I only have my 3 years at collegue. I find this topic very interesting, though a little bit hard to understand at certain times.
He mentioned a DVD that Google sold which had their collection of English words. Anyone know how to obtain it?
Any help will be VERY appreciated^^
pixiemotion(Friday 23rd of November 2007 01:27:17 AM)
Very interesting overview, but the question session in the end revealed a rather low competence among the audience, which is too bad -- there are some much more interesting theoretical questions to be asked. For one, this type of machine translation seems to be founded on having some sort of parallell aligned texts; this is relatively easy for German and English as showed in the examples, they're very similar languages both syntactically and lexically.
pixiemotion(Friday 23rd of November 2007 01:28:34 AM)
Ut what happens when you try aligning eg. polysynthetic languages such as the Greenlandics (where a single word may express what in English would be a ten letter sentence) and analytic languages such a s Chinese (where the average word length is, what, 2.5 letters?). There are a lot of challenges to be met, and it'd be very interesting to see how Norvig and the Google MT team are dealing with them.
pixiemotion(Friday 23rd of November 2007 01:30:12 AM)
The basic method of a probabilistic translation model and a language model is relatively old news (Brown et al, 1990), and the same criticisms that applied 17 years ago have not been answered here: what do you do with language pairs that differ?
Now, if they manage to translate English-Klingon, that'd be impressive.
Erudecorp(Saturday 22nd of December 2007 04:41:34 PM)
You would put that into the search criteria, and have it search words within words (synthetic) or context (analytic). English is a mix of synthetic and analytic already, so you can see it already has those capabilities.