Word count (2/3)
Overview
In previous section, I shared with you my learning project in which I learned more about TDD and improved my Python skills. Whole project was inspired by: https://ccd-school.de/coding-dojo/#cd8.
I will always share one part in one article, but you can easily see whole project in my gitlab repository
Part IV.
In this step The application not only shows the number of words, but also the number of unique words. Sample usage:
$ wordcount
Enter text: Humpty-Dumpty sat on a wall. Humpty-Dumpty had a great fall.
Number of words: 9, unique: 7
Firstly there is not clear how to count word with '-'. When you count Humpty-Dumpty as 2 words, number of words will be 9. Otherwise will be 7. I consider this Humpty-Dumpty to be 2 words.
def test_count_unique_word(self):
# Old (Humpty-Dumpty=2 words) Number of words: 9, unique: 7
# New (Humpty-Dumpty=1 word) Number of words: 7, unique: 6
input_text = "Humpty-Dumpty sat on a wall. " \
"Humpty-Dumpty had a great fall."
number_of_words, unique = simple_word_count(input_text)
self.assertEqual(7, number_of_words)
self.assertEqual(6, unique)
def simple_word_count(input_value_text):
lines = len(input_value_text)
selected_words = []
for word in lines:
if word.isalpha():
if word not in stop_words: # (1)
count += 1
selected_words.append(word)
unique_count = len(Counter(selected_words).items())
return count, unique_count
-
List of stop_words which are loaded from text file.
Part V.
In next part word Humpty-Dumpty should be counted as 1 word. Regex can do work for us. In this case '-' is counted as part of word.
WORD_PATTERN = "[a-z-A-Z]*"
def simple_word_count(input_value_text):
lines = re.findall(WORD_PATTERN, input_value_text)
selected_words = []
for word in lines:
if word.isalpha():
if len(word) > 1 and re.match(WORD_PATTERN, word).endpos > 0:
if word not in stop_words: # (1)
count += 1
selected_words.append(word)
unique_count = len(Counter(selected_words).items())
return count, unique_count
And the test was changed to just counting fewer words.
Part VI.
Further information on statistics should be provided in this section.
The average word length of counted words is calculated and output, e.g.
$ wordcount sometext.txt
Number of words: 14, unique: 10; average word length: 5.63 characters
I will little improve test method for that task. This method should always redirect whole STDIN/STDOUT, run the test and validate result.
def runTest(self, given_answer, expected_out, args):
with patch(BUILTINS_INPUT, return_value=given_answer), \
patch(SYS_STDOUT, new=io.StringIO()) as dummy_out:
my_count.main(args)
self.assertEqual(dummy_out.getvalue().strip(), expected_out)
# should return zero, because space is not valid word
def test_empty_line_args_input(self):
self.runTest(' ', 'Number of words: 0, unique: 0;'
' average word length: 0.00 characters', [])
# test, word = 2 words, both unique, same length
def test_args_input(self):
self.runTest('test a word', 'Number of words: 2, unique: 2;'
' average word length: 4.00 characters',
[])
To retrieve that stats information only small changes are required.
WORD_PATTERN = "[a-z-A-Z]*"
def simple_word_count(input_value_text):
lines = re.findall(WORD_PATTERN, input_value_text)
count = 0
avg_len = 0
selected_words = []
for word in lines:
if len(word) > 1 and re.match(WORD_PATTERN, word).endpos > 0:
if word not in stop_words:
count += 1
selected_words.append(word)
avg_len += len(word)
unique_count = len(Counter(selected_words).items())
result_avg = 0
if count != 0:
result_avg = avg_len / count
return count, unique_count, result_avg
This concludes the second part of the article on word count. I always share one part per article, but you can easily see the whole project at the my gitlab repository.