Monday, June 8, 2020

Mucking about with n-grams


I grabbed some of these from Google, so the first off, we grab the file, zcat it and get the recent-ish (post 1960) data, and only keep stuff with actual letters in.

for i in `seq 1 399` ; do echo $i ; wget "http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-gb-all-4gram-20090715-$i.csv.zip" ; zcat "googlebooks-eng-gb-all-4gram-20090715-$i.csv.zip"  | cut -f 1,2 | grep -P '\t(199|198|197|196|20)' | cut -f 1  | grep -P '[A-Za-z]' | awk '!x[$0]++' > $i.txt ; rm -f "googlebooks-eng-gb-all-4gram-20090715-$i.csv.zip" ; done


Then I used with this ruleset, passphrase.rule, in conjuction with "normal" rules. 

:
s *
s $
s %
s -
s _
s =
s %
s ,
s &
s "
s #
s @
s .
s ,
s /
s !
@
E
Es -
Es ,
Es _
Es =
Es %
Es ,
Es &
Es "
Es #
Es @
Es .
Es /
E@
Es !
c
u
C
@ c
@ C
@ u
s - e-
s . e.
s _ e_
s / e/
s , e,

Getting some fairly pleasing stuff like "Intheworkgroup1."

No comments:

Post a Comment