Statistics on Numbers and Punctuation

Re: RE: How Realistic is the Dom Perignon Test -- John Harms
Posted by Jean Ichbiah , Wed, Nov 27, 2002, 23:10:32 Reply Top Forum

Numbers and punctuation marks are surprisingly rare. The Brown Corpus the corpus mostly used in linguistic analysis (search on Google) shows commas at 0.98% and periods at 0.83%. The Brown Corpus is actually a bit old now and punctuation usage has significantly decreased in the past twenty years, to the point where periods dominate commas in modern text.

Below are statistics from four different large text collections:

  • Bill Machrone is a set of columns he published in PC Magazine.
  • MRoth Email File is a set of email messages collected over a year. (They are unedited and contain a lot of computer-generated headers, hence with more numbers than in text written by mortals.)
  • Model Business Act is legal and administrative language.
  • Dictal is a large set of medical transcription reports

Here are the results:

Bill Machrone
Characters: 205,537
Spaces: 15.0942%
Commas: 0.8423%
Periods: 0.8736%
Letter E: 8.9721%
MRoth Email File
Characters: 1,003,586
Spaces: 15.5468%
Commas: 0.7523%
Periods: 1.1526%
Colons: 0.7877%
Digits: 1.3948%
Letter E: 8.2004%
Model Business Act
Characters: 387,969
Spaces: 14.0976%
Commas: 0.6150%
Periods: 0.7258%
Digits: 0.2462%
Letter E: 9.5633%
Medical Transcription Dictal
Characters: 2,226,323
Spaces: 24.2561%
Commas: 0.2456%
Periods: 1.2581%
Colons: 0.2801%
Digits: 0.5222%
Letter E: 9.1752%

They do confirm Gordon Walker's analysis, namely that punctuation and digits are so rare that they are they are within the margin of error of what you can expect in a text of 182 characters.

Jean Ichbiah

| Edit | Reply Original Top Current page