Brojanje reči

Elem imamo tekst:
Free Bible Download
Uzeti text verziju.
Zadatak je prebrojati koliko puta se ponavlja koja rec nakon sto se text
pretvori u lowercase i ocisti od interpunkcije.
Prikazati reverzno sortirano po frekvenciji prvih 20 reci.
Mala zanimacija, interesantno je videti koliko moze biti efikasno
i elegantno u isto vreme. Svi jezici dozvoljeni pa i shell ;0)
Za pocetak evo elegantne ekstremno neefikasne implentacije
u Haskell-u:

~/.../d/bible >>> cat bbl.hs   
import Data.List (sort,group)
import Control.Arrow ((&&&))
import Data.Char (toLower,isPunctuation)
import Text.Printf (printf)

main = do
  contents <- readFile "bible.txt"
  let result = reverse.sort $
                        map (length &&& head) $
                        group.sort.words $
                        map toLower $ filter (not.isPunctuation) contents
  mapM_ (\(x,y) -> printf "%8d %s\n" x y) $ take 20 result

i rezultat:

  63924 the
  51695 and
  34734 of
  13561 to
  12913 that
  12667 in
  10420 he
  9838 shall
  8997 unto
  8971 for
  8854 i
  8473 his
  8177 a
  7830 lord
  7376 they
  7013 be
  6989 is
  6659 him
  6596 not
  6429 them

Jezici mogu da se razlikuju po tome sta spada u interpunkcijse znake ili ne:)

Gg6GGSa.jpg

:smiley:

Koliko traje izvršavanje/brojanje reči?

./bbl 3.79s user 0.12s system 99% cpu 3.938 total

sledi efikasnija verzija od ove …

Sa obzirom da je String u Haskell-u linked lista operacije sa njim su cisto u edukativne svrhe :stuck_out_tongue:
sledi verzija sa ByteStringom, tj nizom karaktera:

import Data.List (sort,group)
import Control.Arrow ((&&&))
import Data.Char (toLower,isPunctuation)
import Text.Printf (printf)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8

main = do
  contents <- B.readFile "bible.txt"
  let result = reverse.sort $
                      map (length &&& head) $
                      group.sort.B8.words $
                      B8.map toLower $ B8.filter (not.isPunctuation) contents
  mapM_ (\(x,y) -> printf "%8d %s\n" x (B8.unpack y)) $ take 20 result

ovo traje duplo krace ;p

Običan one-liner awk, sa sve sortiranjem :slight_smile:

awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' bible.txt | sort -k 2 -n

Vreme izvršavanja: 1.24 sekunde

Disclaimer:
NISAM programer :slight_smile:
Screenshot - 12152018 - 10:37:19 PM.png

I na kraju verzija sa hashtabelom gde elegancija gubi na svojoj vrednosti :wink:

[CODE]{-# Language BangPatterns #-}
import qualified Data.HashTable.IO as H
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8
import Data.List
import Text.Printf
import Data.Char
import Data.IORef

type HashTable k v = H.BasicHashTable k v

wordFreq :: [B.ByteString] -> IO [(B.ByteString, Int)]
wordFreq xs = do
ht <- H.new :: IO (HashTable B.ByteString (IORef Int))
mapM_ (updatefreq ht) xs
lst <- H.toList ht
mapM ((x,y)-> do
v <- readIORef y
return (x,v)) lst
where updatefreq ht word = do
!lu <- H.lookup ht word
case lu of
Nothing -> do
ref <- newIORef 1
H.insert ht word ref
Just x -> modifyIORef’ x (+1)
return ()

main = do
contents <- B.readFile “bible.txt”
result <- wordFreq.B8.words $
B8.map toLower $ B8.filter (not.isPunct) contents
let sorted = reverse.sort $ map ((x,y) -> (y,x)) $ result
mapM_ ((x,y) -> printf “%8d %s\n” x (B8.unpack y)) $ take 20 sorted

isPunct c =
c == ‘’’ || c == ‘.’ || c == ‘;’ || c == ‘(’ || c == ‘)’
|| c == ‘"’ || c == ‘?’ || c == ‘-’ || c == ‘_’ || c == ‘!’
|| c == ‘,’ || c == ‘:’ || c == ‘|’

[/CODE]

  63924 the
  51696 and
  34734 of
  13561 to
  12913 that
  12667 in
  10420 he
  9838 shall
  8997 unto
  8971 for
  8854 i
  8473 his
  8177 a
  7830 lord
  7376 they
  7013 be
  6989 is
  6659 him
  6596 not
  6430 them
./bbl4  0.19s user 0.01s system 97% cpu 0.212 total

Gle to je samo 25 puta brze ;p

A evo i moje skriptice koju koristim da dobijem vreme izvršavanja :slight_smile:

[CODE]#!/bin/bash
start=$(date +%s.%N);
sleep 0.1s;
./counting.sh; \

time=$(echo “$(date +%s.%N) - $start” | bc);
printf “Vreme izvršavanja : %.2f sekundi\n” $time[/CODE]

awk sam smestio u posebnu skripticu, counting.sh

Nije dobar rezultat, moras da pretvoris u lowercase, izbacis interpunkciju i sortiras reverzno :wink:

Zašto?

Da bi izjednacio reci The i the znaci The i the je ista rec 2 puta :wink:

Ok…boja je kolor, a Boja postoji i kao žensko ime…prvo što mi je palo na pamet :slight_smile:

Osim toga, prošlo je vreme MSDOS-a, fajl sistemi razlikuju mala i velika slova :slight_smile:

Ne ulazimo sad u semantiku, to bi zakomplikovalo stvari ;p

@Branimir Maksimovic
Ama, semantika je realnost, zato i pucaju programi koji je zanemaruju :smiley:

Kad smo pominjali Go, evo isto to u Go-u :stuck_out_tongue:

[CODE]package main

import (
“fmt”
“unicode”
“io/ioutil”
“strings”
“sort”
)

func main() {
buf, err := ioutil.ReadFile(“bible.txt”)
if err != nil {
fmt.Println(err)
return
}
content := make(map[string]int)
filtered := make([]byte,0,len(buf))
for _,v := range buf {
if !IsPunct(v) {
v = byte(unicode.ToLower(rune(v)))
filtered = append(filtered,v)
}
}
fmt.Printf("%d\n%d\n",len(buf),len(filtered))
words := strings.Fields(string(filtered))
for _,v := range words {
content[v]++
}
output := make(PairList,0,len(content))
for k,v := range content {
output = append(output,Pair{v,k})
}
sort.Sort(output)
for i := 0;i<20;i++ {
fmt.Printf("%8d %s\n",output[i].count,output[i].word)
}
}

func IsPunct(c byte)bool {
return c == ‘’’ || c == ‘.’ || c == ‘;’ || c == ‘(’ ||
c == ‘)’ || c == ‘"’ || c == ‘?’ || c == ‘-’ || c == ‘_’ || c == ‘!’ ||
c == ‘,’ || c == ‘:’ || c == ‘|’
}

type Pair struct {
count int
word string
}

type PairList []Pair

func (this PairList) Len()int { return len(this) }
func (this PairList) Swap(i,j int) { this[i],this[j] = this[j],this[i] }
func (this PairList) Less(i,j int)bool { return this[i].count > this[j].count }
[/CODE]

~/…/d/bible >>> time ./bbl
4587478
4430654
63924 the
51696 and
34734 of
13561 to
12913 that
12667 in
10420 he
9838 shall
8997 unto
8971 for
8854 i
8473 his
8177 a
7830 lord
7376 they
7013 be
6989 is
6659 him
6596 not
6430 them
./bbl 0.12s user 0.04s system 135% cpu 0.122 total

[/code]

Deluje da je Go efikasniji :wink:

Nego…instalirao sam haskell, ali ne mogu da pokrenem program koji si okačio…gde grešim?
Ispljune greške, napravi i neke fajlove, ali nix…
Screenshot - 12152018 - 11:00:31 PM.png

Ma to kompajliras sa ghc-om :wink:
Ovako:

~/.../d/bible >>> ghc -O2 bbl.hs  
[1 of 1] Compiling Main  ( bbl.hs, bbl.o )
Linking bbl ...

I onda startujes executable. Taj primer sa hashtable treba da instaliras jos i
cabal update
pa
cabal install hashtables
i
cabal install bytestring

Uf…kakav sam magarac, da previdim da mora da se kompajlira…tnx :slight_smile:

Ne ide kompajliranje tvoje sors skripte kako treba…koju verziju ghc i haskella koristiš?