史上绝地反击,美式英语英文学习大全。美国英语最新词频表

news/2024/11/29 7:41:09/
美国英语最新词频表
2010-04-10 13:04

(4月13日补充:这两天用网上的一些文章和GMAT的一份资料验证了一下这个WORDLIST的覆盖率,证明它的20000单词的覆盖率真的很高,几乎全部覆盖,只有一两个很个别的词没查到。它的前5000单词所带的词族估计有一万多单词,如果能熟练运用,英语水平就已经很不错了)。

因为准备8月开始的MBA课程,所以最近有意识地上网找wordlist(单词表)来加强一下词汇。GMAT、gre的单词表中很多生涩的单词只有专业文章才用,在日常学习生活中使用率很低,所以学习效率不高。后来找到了一个网上很流行的6138个单词的词频表,没看完就晕了,一方面因为它的出处是英国英语,另一方面拼写方式都很古老,甚至有whilst这样的词。whilst在美国现代用语中肯定是20000以外的词汇。可见那个表的古老程度了。功夫不负有心人,终于发现了一个最新的来自于CCAE的单词表。

CCAE“美国当代英语词汇研究”(Corpus of Contemporary American English)是这个世纪里最大的美国语言学研究项目,地位相当于影响深远的英国的BNC-British National Corpus。我们目前使用的大多数英语词频表都是从BNC来的,换据话说都是英国英语的词频,而且是1980年代以前的词频。

美国CCAE至今还没结束,目前收集了4亿词汇的文献资料。这4亿词汇的基础材料包括1990-2009二十年里阅读量最广泛的小说和杂志(“TIME”、“New Yorker”等都是项目的参与者),电影、电视节目,大量的电话记录和面对面谈话记录,甚至还包括911报告等...)。它根据使用时间、文献性质等使用统计学方法进行分类统计,等于是在编一本带词频和流行用法的新美国英语使用辞典。

在CCAE当前成果基础上,美国杨百翰大学对这个资料库用计算机方法筛选出了美语使用频率最高的20,000个高频词汇和它的类词库
其中前5000个最高频词汇的list文件已经可以下载:
到http://www.wordfrequency.info/?freeList=y
点击最下面的 "download the list"。

另外,5000和20,000词汇的电子书的样本(两者包括5000个左右的样本单词)也可以免费下载,见http://www.wordfrequency.info/files/entries.pdf

这个wordlist最牛的是每个单词不仅带词频和同义词,而且都标注着“类词集”。类词集就是把这个词使用最相关、使用密度最高的词的集合。有了它,我们就知道美国人对这个词的最常用的几十种用法和使用环境。比如说break这个词的类词集里,前四个常用邻接词是law,heart,news和rule,所以我们猜测这个词的最高频用法是break law,break heart, breaking news和 break the rule。这比字典里的例句对培养语感所起的作用大不知高出多少倍。

下面是关于它特点的英文介绍,或者去网站http://www.wordfrequency.info直接看吧。

另外,如果你帮助他们在大的英语学习者的论坛里发一个贴子做宣传(发一个就行),然后把link用电子邮件发给他们,还能够免费得到5000单词的词频表和类词集的电子书。这本书的印刷版在AMAZON也可以买到。


目前,这算是我见过的最好的wordlist了。


COMPARE (to data from the British National Corpus / American National Corpus)
There are many English word lists and frequency lists out on the Web. Some are good, some are very bad. Not all frequency lists are created equal.
One should be very, very suspicious of word lists that are taken from small samples of web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse, word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)". 
Rather than focusing too much on a comparison with specific wordlists that are out there on the Web, here's some questions you might ask yourself as you consider downloading or purchasing a word list:
Depth and accuracy. Why do so many wordlists on the web contain just the top 1000-3000 words of English? Why not the top 10,000 or 20,000? It's because even a bad corpus (the collection of texts that the word lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 10,000 or 20,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list.
Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, and academic journals? Frequency lists that are based on just one of these may only contain 40-50% of the words from a more balanced corpus. Our frequency list is based on the Corpus of Contemporary American English (COCA), which is almost perfectly balanced across genres.
Size. COCA contains more than 400 million words, and each of the top 20,000 words occurs at least 300 times. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point, the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA.
How recent is it? Language change happens. If the word list is based on 15-20 year-old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language. COCA is based on texts from 1990-2009 (20 million words each year)-- or in other words, virtually right up to the current time.
Is it just a bare wordlist? Word lists are nice, but to be really useful (especially for language learning) there ought to be some indication of what these words mean and how they are used. Most of our frequency lists contain the top 20-30 collocates (nearby words) for each word in the list, which creates a great "sketch" of each word.
--------------------------------------------------------------------------------
Summary. There are many word frequency lists out on the web. Some are just OK, and some are truly bad. The frequency lists that we have created are the only ones that are based on a large, recent, and balanced corpus of English, and which provide indications of the meaning and use of each word.



Word frequency lists and dictionary 
from the Corpus of Contemporary American English

home uses compare samples free list n-grams non-english academic purchase
  


This site contains what we believe is the most accurate frequency data of English, and it comes in a number of different formats (see the table below).

Any frequency list is only as good as the corpus (collection of texts) that it is based on. Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 450 million word Corpus of Contemporary American English. You can be sure that the data that you find here represents what you would encounter in the real world.

If you are a language learner, you can use the frequency lists to maximize your study of vocabulary in a way that is not possible with any other resource.  If you are a (computational) linguist, you will have access to highly accurate, robust and useful data for research and for Natural Language Processing. (More information on how to use this data.)

The English frequency data comes in a number of different formats, shown below. You can also get frequency data for Spanish and Portuguese or Academic English.

Basic word lists

Top 5,000-60,000 words (lemmas)

Genre frequency

See the frequency of each of the top 60,000 lemmas -- in spoken, fiction, popular magazine, newspapers, and academic, as well as more than 40 sub-genres like NEWS-Financial or ACAD-Medicine. You can then use this data to create your own customized lists for particular genres and sub-genres.

Collocates

Collocates = "nearby words", and they provide great insight into the meaning and use of words -- more than any other lists. See (a maximum of) 200-300 collocates for each of the 60,000 words, giving nearly 4,800,000 node word / collocate pairs.

N-grams

Up to 155 million unique 2-5 grams (2-5 words sequences), with frequencies for each string. Allows you to search for the patterns in which a word occurs.

eBook

The 20,000 most frequent words (lemmas) in American English, along with the 20-30 most frequent collocates and the synonyms for each word

Printed book

(From Routledge). The top 5,000 words (including collocates) and thematic lists

Free word list

Basic list of the top 5,000 lemmas

Contact information



http://www.ppmy.cn/news/360214.html

相关文章

美剧字幕组高手写的学英语心得

http://qing.weibo.com/tj/898af89233000yu0.html?sourceq2b_tj_1_117 听力篇: 考拉小巫,伊甸园论坛字幕组组长,大家经常看的美剧或欧美电影都是她带领的团队第一时间听译出来的,江湖上有人称之为“大陆听力第一人”。讲述她如…

电视机计算机英语,电视技术中常见英文解释

电视技术中常见英文解释 最近在做毕设文献翻译,整理了一些电视技术中常见的英文名词 Horizontal Sync:行同步。行同步脉冲用来告诉接收系统,新的扫描行的起始点。 HIS:Hue, Saturation and Intensity。色调,饱和度和强…

看美剧学口语心得和推荐

1. 看美剧学口语的优点: 1) 时间自由,地点自由,只要有laptop和耳机即可,不用费心和学校给找的米国人约时间喝咖啡聊天, 也避免一开始口语听力不好直接在生活中和人说英语的紧张感;而且美剧比电影时间要短,比较长的drama一集40分钟,sitcom一集20分钟左右,可以比较好的利用时间,不…

看看大师们讲解英语学习方法

<script type"text/javascript"> </script> <script src"http://pagead2.googlesyndication.com/pagead/show_ads.js" type"text/javascript"></script> 看看大师们讲解英语学习方法 许国璋* 学英语就要无法无天&…

Time For Kids 很不错的英语学习周刊

英语&#xff0c;是全世界使用范围最广的一门语言&#xff0c;因此&#xff0c;学好英语是一件很重要的事情&#xff0c;而它的重要和存在形式也不仅仅是在学习分数上&#xff0c;英语是一种语言&#xff0c;也是一门工具&#xff0c;学好它&#xff0c;对于我们&#xff0c;尤…

如何用美剧真正提升你的英语水平

看到很多童鞋讨论有关美剧学习英语到底有没有用&#xff0c;以及用哪部美剧练习&#xff0c;我在这里想说这只是一个参考&#xff0c;世界上没有绝对的事情&#xff0c;究竟有没有用看个人 1. 不是所有的美剧都适合学英语 如果喜欢看如《24小时》这样的动作片, 那你基本会讲一口…

看好电影,学标准英语

下面推荐发音超级标准&#xff0c;语汇简单&#xff0c;句型经典&#xff0c;特别是省略句式非常好的两部电影&#xff1a; 标准女音电影 Sabrina 情归巴黎 标准男音电影 Dove 真假总统 女演员中&#xff0c;发音比较好的著名女演员有&#xff1a; Mag Rain Julia Ro…

【CNN互动英语】【6CD】【英语学习精品!】

序列号: CAMY-XOUFUGHZKHZ下载: BT网站 新东方《CNN互动英语》以原汁原味的美国CNN电视节目为素材&#xff0c;集视、听、说、读、写为一体。它 以最完美的英语语音&#xff0c;从财经、娱乐、时尚、自然、环境、人物、生活、健康&#xff0c;科学技术、社会和 …