15-110 Spring
2011 Homework 4
Due Sunday,
20-Feb, at 10pm
Same basic directions as hw3,
so read those carefully
(replacing "hw3" with "hw4", where appropriate, of course).
Additional notes:
- Start
by creating a folder named hw4. Place hw4-solo.py and hw4-collab.py in that file. You will submit hw4.zip as
usual, which will be a zipped version of the entire hw4 folder.
- This week's hw
has a special emphasis on applications in Humanities
and Business.
Future hw's will focus on other disciplines. Our
aim is to
provide at least two weeks of special emphasis for each college at CMU.
- Reminder:
portions marked SOLO
must be solved entirely on your own (you may discuss with
course staff, but not with each other or any outside sources).
See syllabus for details.
- There are no
restrictions on the Python contructs you may now use (conditionals,
loops, lists, etc).
- SOLO
Problem-Solving [35 points]
- SOLO median [10 pts]
Background:
Many computing tasks in the humanities require basic statistical
analyses. Here we will compute one of the most basic statistical
measures: the median.
Your task: write the function median (in the file hw4-solo.py). This
function takes a non-empty list of integers and returns the median of
the list, also as an integer, without
modifying the list (though you may modify a copy of the
list). The median of a sorted list of odd length is the
middle element. So the median of the list [3, 6, 20] is 6.
On the other hand, the median of a sorted list of even length
is the average of the middle two elements. So the median of
the list [3, 6, 20, 21] is the average of 6 and 20, or 13.
Note that the median of the list [2,3] is not 2.5, but instead is 2,
because we are using integer division. Here are some test
cases for you:
assert(median([2]) == 2)
assert(median([2,3]) == 2)
assert(median([2,3,42]) == 3)
assert(median([2,3,42,99]) == 22)
- SOLO Automated
Readability Index [25 pts]
Background: there are various algorithms to determine the
approximate grade-level of reading material. The "Automated
Readability Index", or ARI, is one such algorithm specifically designed
to be relatively easy to compute (see the
Wikipedia page for details). It takes a string of
text and returns, as a float, the corresponding grade level of the
text, where 13 is college-freshman-level. Specifically, the
ARI is computed as such:
Note #1: characters only include letters or digits (ignoring spaces,
punctuation, and other symbols).
Note
#2: when computing words, ignore all non-alphanumerics (that is,
non-letters and non-digits) except spaces, question marks (?),
exclamation points (!), and periods (.). Then, a "word" is a
block of adjacent letters or digits.
Note #3: you can assume
that sentences are always terminated by a question mark, exclamation
point, or period. However, you should treat a run of several
duplicate sentence terminators as just one. So, "This is one
sentence!!!!!" is just 1 sentence.
Your task: write the function
automatedReadabilityIndex (in the file hw4-solo.py).
This function takes a string of text and returns the Automated Readability Index as described above.
To encourage you to use top-down design in this task, you must also write the following 3 helper functions:
charCount This function takes a string and returns the number of letters or digits in the string.
wordCount This function takes a string and returns the number of words in the string, as defined above.
sentenceCount This function takes a string and returns the number of sentences in the string, as defined above.
And to help you write these helper functions, you must also write the following helper function:
isLetterOrDigit:
This function takes a single character and returns True if it is
an uppercase or lowercase letter or a digit, and False otherwise.
Hopefully, you will see that by writing these helper functions, the
problem becomes easier. In any case, you will find test cases for
these functions in the starter code we
provided, though you are welcome to add more test cases, as always.
- Collaborative
Problem-Solving (with Conditionals and Loops) [45
pts]
- Top Unigrams [20 pts]
Background:
first, skip ahead and do the reading on the Google ngram viewer (see part 2 below).
That tool counts how often various words or word phrases occur in
(a huge amount of) text. Here, we will do something similar,
though we will restrict ourselves to unigrams (that is, just one word,
and not word phrases). Also, in part to help keep this homework
manageable, but also to help you develop your skills in reasoning over
code that you did not write, we have solved the main part of this
exercise (the topUnigrams function, as described below). You just
need to write the helper functions here.
The main function for
this exercise is topUnigrams, and it takes an unsorted list of words
(which it may not modify) and a non-negative integer n, and returns a
list of the top n unigrams. Each unigram value in the result
should be a list of two elements -- the word and the count (the number
of times that word occurs in the text). The list of unigrams
should be sorted in the order of the counts, but ties should be sorted
by the word (using Python's default string comparison, so "Z" <
"a"). To help you understand the problem, here are some test cases:
wordList = ["dog", "cow", "cat", "cow", "cat", "cat", "dog", "dog", "dog", "cow"]
assert(topUnigrams(wordList, 0) == [])
assert(topUnigrams(wordList, 1) == [["dog", 4]])
assert(topUnigrams(wordList, 2) == [["dog", 4], ["cat", 3]])
assert(topUnigrams(wordList, 3) == [["dog", 4], ["cat", 3], ["cow", 3]])
assert(topUnigrams(["ack"], 2) == [["ack", 1]])
Once again, we have already written topUnigrams for you!
So your first step is to understand our code. Note that
there are somewhat more efficient ways to solve this, but our approach
is reasonably efficient while only using techniques we have already
covered in a straightforward manner. Once you understand this
function, you need to write its helper functions, as follows:
unigramOccursBefore [5 pts]: This
function takes two unigrams (that is, lists of length two where the
first value is a word and the second value is a count of how many times
that word occurs in some text) and returns True if the first unigram
should occur before the second unigram in the resulting list from
topUnigrams. This is true if the first unigram's count is higher,
or if the two have the same count and the first unigram's word is less
than the second unigram's word according to Python's default string
comparison (so "Z" < "a"). Here are some test cases for you
(study them carefully to be sure you understand the problem!):
assert(unigramOccursBefore(["dog",5],["cat",8]) == False)
assert(unigramOccursBefore(["cat",8],["dog",5]) == True)
assert(unigramOccursBefore(["cat",8],["cat",8]) == False)
assert(unigramOccursBefore(["cat",8],["cat",4]) == True)
assert(unigramOccursBefore(["cat",4],["cat",8]) == False)
insertUnigram [15 pts]:
This function takes three parameters -- a sorted list of
unigrams, a new unigram, and a number n (the maximum number of
unigrams) -- and destructively inserts the new unigram in the list as
appropriate (and returns nothing, seeing as this is a destructive
function). The function proceeds in two phases. In the first
phase, it adds the new unigram to the list. There are two cases
to consider here. If the list has fewer than n unigrams, just add
the new unigram to the end of the list. However, if the list
already has n unigrams, then replace the last unigram in the list with
the new unigram. Now the new unigram is in the list, but not
necessarily in the correct location (so the list is unsorted).
Thus, in the second phase, the function sorts the unigram list.
This cannot be done by calling list.sort()! Instead, do the
following: since we know the entire list except the last element
is already sorted, just keep swapping the new unigram with the unigram
to its left until you have swapped it into sorted order. This can
be done in one pass with a straightforward loop. Here are some test cases for you (and again, study them carefully to be sure you understand the problem!):
unigrams = []
insertUnigram(unigrams, ["cow", 5], 3)
assert(unigrams == [["cow", 5]])
insertUnigram(unigrams, ["dog", 3], 3)
assert(unigrams == [["cow", 5], ["dog",3]])
insertUnigram(unigrams, ["cat", 3], 3)
assert(unigrams == [["cow", 5], ["cat", 3], ["dog",3]])
insertUnigram(unigrams, ["yak", 9], 3)
assert(unigrams == [["yak", 9], ["cow", 5], ["cat", 3]])
- Basic Economic Data Analysis [25 pts]
Background: many problems in economics involve analyzing tables
of data. Later in the course, we may do some more complex
analyses or data visualizations, but here we will limit ourselves to
some basic data manipulation and analysis. As for the data, we
will consider this table of wages by industry and by year. Here is an excerpt of the first several rows of the table:
Line | | 1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
2004 |
2005 |
2006 |
2007 |
2008 |
2009 |
1 | Wage and salary accruals | 4,180,916 | 4,465,176 | 4,827,698 | 4,952,202 | 4,997,306 | 5,154,598 | 5,410,691 | 5,705,982 | 6,070,143 | 6,415,473 | 6,554,044 | 6,279,120 |
2 | Domestic industries | 4,185,479 | 4,470,389 | 4,832,388 | 4,957,417 | 5,002,880 | 5,160,302 | 5,416,839 | 5,712,389 | 6,076,756 | 6,422,567 | 6,561,363 | 6,286,930 |
3 | Private industries | 3,484,241 | 3,736,700 | 4,052,677 | 4,135,533 | 4,129,767 | 4,246,977 | 4,464,031 | 4,720,900 | 5,041,605 | 5,333,540 | 5,417,383 | 5,113,359 |
4 | Agriculture, forestry, fishing, and hunting | 24,232 | 24,820 | 25,910 | 26,622 | 26,796 | 26,754 | 28,639 | 29,659 | 32,609 | 34,802 | 35,874 | 36,148 |
5 | Farms 1 | 15,883 | 16,169 | 16,974 | 17,926 | 17,919 | 17,586 | 19,109 | 19,597 | 20,005 | 22,155 | 22,930 | 23,840 |
6 | Forestry, fishing, and related activities | 8,349 | 8,651 | 8,937 | 8,697 | 8,876 | 9,168 | 9,530 | 10,062 | 12,604 | 12,647 | 12,944 | 12,308 |
7 | Mining | 29,233 | 28,583 | 29,602 | 32,010 | 30,718 | 31,260 | 34,856 | 40,228 | 48,049 | 53,800 | 62,315 | 55,006 |
8 | Oil and gas extraction | 10,151 | 10,115 | 10,850 | 11,502 | 11,306 | 11,505 | 13,078 | 14,548 | 17,407 | 19,046 | 23,128 | 21,701 |
9 | Mining, except oil and gas | 11,091 | 10,961 | 10,583 | 10,779 | 10,397 | 10,248 | 10,964 | 11,974 | 13,048 | 13,461 | 14,272 | 13,319 |
10 | Support activities for mining | 7,992 | 7,507 | 8,169 | 9,729 | 9,015 | 9,507 | 10,814 | 13,707 | 17,594 | 21,293 | 24,915 | 19,986 |
As
you can see, the rows represent different industries, and the columns
represent years, and the numeric values in the table are the salaries
(in millions of dollars) for each industry in each year. You may
notice that different industries are indented differently. This
is because the industry list is hierarchical,
with each summary row representing an industry group that includes the
sum of all the values of the rows indented below it. We will say
that a row contains a "single industry" if it is not a summary row
(that is, if the following row is not indented by more than the row in
question).
The first step in using a data source such as
this is to "condition" the data -- to convert it to an easily usable
form. In this case, we have already done this for you. The
file hw4-collab.py contains the function getWageData which wil return
the table above, but cleaned up for use in Python (and with the last 3
rows commented out, as they deal with non-domestic data, and for this
problem we are limiting our analysis to domestic data).
Then, write the following two functions:
makeSubtable [15 pts]: To support interesting queries over subsets of the
data, write the function makeSubtable. This function takes four
values -- a table formatted as the one above, the name of an industry
group, a start year, and an end year -- and returns a new table, also
formatted as the one above, but containing only the single industries
in the given industry group, and containing only the years in the given
range. you may assume the industry group is in the table, as are
both the start and end years. Here are some test cases for you (and again, study them carefully to be sure you understand the problem!):
table = getWageData()
subtable = [ ["line","", 1999, 2000, 2001, 2002],
[8, "Oil and gas extraction", 10115, 10850, 11502, 11306],
[9, "Mining except oil and gas", 10961, 10583, 10779, 10397],
[10, "Support activities for mining", 7507, 8169,
9729, 9015] ]
assert(makeSubtable(table, "Mining", 1999, 2002) == subtable)
subtable = [["line", "", 1998, 1999],
[89, "Civilian", 87876, 90276],
[90, "Military", 54048, 55524],
[91, "Government enterprises", 36948, 37716],
[94, "Education", 257912, 272332],
[95, "Other", 229376, 241003],
[96, "Government enterprises", 35078, 36838]]
assert(makeSubtable(table, "Government", 1998, 1999) == subtable)
Hint:
you should use top-down design with some well-chosen helper
functions. For example, in our sample solution, we found it
helpful to write a function that took a string and returned the number
of leading spaces in it. We also wrote a function that took a
table and the name of an industry group and returned the index of the
row of that industry group in that table. And we found it handy
to write a function that indicated whether or not a given row contained
a single industry or an industry group. We also wrote a few other
helper functions, but you get the idea. In any case, do not place
all your code in one monolithic function!
Another hint: if
s is a string, then s.strip() will return a new string with the leading
and trailing spaces removed. Handy!
highestPaidSingleIndustry [10 pts]: Write the function highestPaidSingleIndustry, which takes a table and finds the largest
salary in any year for any single industry in the table, and returns a
list of length 2 containing that single industry's name and the year. Here are some test cases for you (and once more, study them carefully to be sure you understand the problem!):
wageData = getWageData()
assert(highestPaidSingleIndustry(wageData) == ["Education", 2009])
subtable = [ ["line","", 1999, 2000, 2001, 2002],
[8, "Oil and gas extraction", 10115, 10850, 11502, 11306],
[9, "Mining except oil and gas", 10961, 10583, 10779, 10397],
[10, "Support activities for mining", 7507, 8169,
9729, 9015] ]
assert(highestPaidSingleIndustry(subtable) == ["Oil and gas extraction", 2001])
- Readings in
Computational Thinking [20 pts]
- Google ngram Viewer [10
pts]
Read this Wall Street Journal article on Google's ngram viewer. Then, play around for a while with the actual Google ngram viewer.
Then, write a short paragraph explaining what the Google ngram
viewer is and how it functions, described so a lay person could
understand and appreciate it. Then, using the tool,
find some trends that you find interesting, and include a short
paragraph explaining the trends you found and why you find them to be
interesting. We would expect that everyone would submit unique
trends that they discovered on their own.
- Computing, Digital Media and the Egyptian Revolution [10
pts]
Write a short paragraph explaining the role of computing, and
especially digital media, in the recent (and even ongoing) Egyptian
Revolution (or, to be precise, the events leading to the resignation of
President Mubarak). Find, read, and cite at least two relevant
and generally respected sources to support this paragraph. Next,
write another short paragraph speculating on how computing trends might
influence politics and governance across the world in the next 20
years. We are looking for sound critical thinking and clear
explications here.
- Bonus/Optional:
justifyText [3 pts]
Write the
function justifyText (in the file hw4-collab.py). This function
takes two parameters -- a string and a positive integer lineWidth --
and returns a new string, with embedded newlines, such that each line
except perhaps the last line is precisely lineWidth characters wide.
Do not hyphenate words or split words on hyphens. Instead,
include as many entire words as can fit on each line, and then include
extra spaces between words (spread out evenly across the line) so the
line exactly fits the lineWidth. Do not add extra spaces to the
last line, however. Do not worry about strange cases, such as
words that are longer than the lineWidth. Make your own test
function, too.
- Bonus/Optional:
Psychology Experiment: Recalling concrete, abstract, and nonsense words [3 pts]
Here is a page of some interesting psychology experiments involving memory:
http://faculty.washington.edu/chudler/chmemory.html
If
you scroll down far enough, you'll find a section titled "Concrete
Words, Abstract Words and Just Plain Nonsense". Read that
section, then design a simple experiment using a Python program to
attempt to confirm the hypothesis that it is easier to recall concrete
words than abstract words, and easier to recall abstract words than
nonsense words. Then run the experiment. Then, in your
hw4.doc file, include a paragraph or two explaining your approach,
including the raw data of your experiment, and then explaining whether
the experiment confirmed or rejected the hypothesis.