Hallway Mathlete

Saturday, February 1, 2020

Hand-Drawn Graphics On African American Life In 1900

Traditionally this blog has been original content, but these graphs were too cool not to share. The following graphs were hand drawn by W. E. B. Du Bois right before the turn of the last century. Du Bois was the first African American to graduate from Harvard University in 1895. Where he studied history and sociology.

The graphs originally were displayed as an exhibit organized by Du Bois, at the Exposition Universelle in Paris, titled "Exhibit of American Negroes". The full collection of graphs from the exhibit are digitized by the Library of Congress and can be found here.

Income and Expenditure of 150 Negro Families in Atlanta, Ga., USA.

Proportion of Freemen and Slaves Among American Negroes

Race Amalgamation in Georgia

Conjugal condition of American Negroes according to age periods.

City and Rural Population (Du Bois really went wild with this one)

Negro Teachers in Georgia Public Schools

American Negro newspaper and periodicals

Valuation of Town and City Property owned by Georgia Negroes

Assessed Value of Household and Kitchen Furniture Owned By Georgia Negroes

Negro Property in Two Cities of Georgia

Source: Library of Congress

Monday, April 3, 2017

Rick and Morty by the numbers

Rick and Morty is quickly becoming a cult classic. With the last episode coming out on October 4, 2015, fans have been starving for new episodes. On April fools, the creators of the show surprised fans with a new episode from the long awaited third season.

After seeing the new episode, I decided to do an analysis of Rick and Morty. This analysis is heavily influenced by Todd Schneider's blog post analyzing The Simpsons (I highly recommend you check it out).

Rick and Morty Ratings

First, I wrote code to gather data from IMDb. The average rating of each episode is above 8; except season 2 episode 8: "Interdimensional Cable 2: Tempting Fate". In this episode, Jerry is faced with keeping his penis or have it surgically removed to save the life of an alien leader. The other main characters sit in the lobby of the hospital and watch interdimensional cable. The shows on cable are short clips and are extremely random. Fans accustomed to the great banter between Rick and Morty get almost none this episode.

The family dynamic

Then I gathered and parsed scripts from RickAndMorty.wikia.com. Unfortunately, only about half the scripts are online. I took the average number of words each member of the family spoke per episode. I left out Mr. Poopybutthole and Sleepy Gary because they are more "like family" than family.

Rick is the center of the show and has by far the most lines. Next is Jerry, this is because in the episode "The Wedding Squanchers", Morty has almost no lines. Since I only have a portion of the scripts, this lowers Morty's average number of lines to below Jerry. When more transcripts are uploaded to the Wikia, I will rerun my analysis and update the post.

Recurring Characters:

Justin Roiland and Dan Harmon have a special ability to write unique characters that audiences love. Below is a chart of recurring characters from the show. This data is from the credits. Sometimes characters are not in the episode, but are still in the credits. When this happens, the character is represented with a gray box.

After waiting 2 years, I can't wait to watch more of season 3 and I hope Justin and Dan choose to include more of Snuffles. I think it could be developed into a very satisfying plot for people of all ages.

Note:

The graphs were made using ggplot2 and the data was gathered using rvest. (God bless Hadley Wickham)

Tuesday, November 15, 2016

College Majors with Fewer Women tend to have Larger Pay Gaps

The Gender Pay Gap has become a key talking point for many in politics in the past few years. It is commonly quoted that women make only 77¢ for every $1 of their male counterparts. This statistic is calculated by taking the average salary of women and comparing it to the average salary of men. This does not control for job title, degree, regional difference, company size, or hours worked. After controlling for these factors, the gender pay gap drops from 77% to 98% (Payscale.com does a fantastic analysis and break down of the gender pay gap. I highly recommend you check it out.)

Using the salaries that controlled for these factors from the Payscale report, I decided to do further analysis on the pay gap. I wanted to know how the percentage of women in a field effected the pay difference between men and women. Since many companies try to hire equal amounts of men and women for similar roles and there are significantly fewer women in science and engineering, I assumed these few women would be able to demand a higher salary because of the scarcity. However, I found the opposite.

The interactive graphic below shows that college majors with fewer women tend to have higher differences in pay. (The graphics may be difficult to see on mobile, please switch to a desktop or use the mobile-friendly version)

Two of the largest outliers are Nursing and Accounting. Nursing has 92.3% women and men still make about $2400 more on average. Accounting has 52.1% women and a pay difference of +$2400 for men. One point of interest is elementary education where women make 1.4% more than men.

The following graph shows the same results as above, but instead of absolute difference in dollars, the commonly used metric of women's salaries as a percentage of men's salaries is plotted.

Any guess as to why this occurs would largely be speculative, but I imagine a “boy’s club” mentality could be to blame where men like to hire men and are willing to even pay more to do so.

The links to the data sources can be found below and as always, add any suggestions in the comments.

Data Source:

[1] http://www.payscale.com/career-news/2009/12/do-men-or-women-choose-majors-to-maximize-income
[2] https://docs.google.com/spreadsheets/d/1FDrXUk4t-RQekuKotqMD7pyGFmylHip-xarOawcVvqk/edit#gid=0

Monday, October 31, 2016

A Look at Lynching in the United States

With the current racial tension within the United States and the Halloween season, I have seen many posts on Facebook showing black mannequins and other human-like figures hanging from trees. These images spark debate on whether these are directly racist acts or just distasteful mistakes. The comment sections are filled with pseudo-lynching-experts defending both sides.

So I decided to do an analysis on lynching within the United States. I found data on lynching from the Tuskegee Institute [1-2]. This data set contains information on lynchings from 1882-1962. I tried to find lynching statistics from before 1882 and found, during slavery, lynching Africans was fairly uncommon due to slave owners having a vested interest in keeping the slave alive.

The first bit of information I found shocking was whites made up 27% of lynching victims. However, the percentage of whites vs. blacks changes wildly from state to state. To understand this the percentage of black victims were plotted by state. As can be assumed, the deep south had the highest percentage of blacks followed by the rest of the south.

However, a look at the total number of lynchings showed states in the deep south did not offend evenly. Mississippi, Georgia, and Texas had the most lynchings (581, 531, and 493 respectfully) and then there was a large drop off to Louisiana and Alabama at 391 and 347. New Hampshire had zero lynchings and Delaware, Maine, and Vermont had one. New York and New Jersey only had two.

Over time, the number of people lynched has dropped to near zero. The next graph shows how lynching has decreased over time for whites and blacks. Black lynchings have about a 30 year lag behind the decrease of white lynchings.

In the first few years of the above chart, whites out number blacks in the number of lynchings. Below is a graph showing the percentage breakdown by year. Whites out number blacks for 4 years and then blacks out number whites for 60 years. In the last few years in the data set, whites exceed blacks. When taking into account how infrequent lynchings where in these last few years, two white people being lynched can account for 66.6% of lynchings.

I hope you learned something new, reading this analysis and as always, please feel free to add ideas for additional graphs or analysis in the comment section.

Data Link and Notes:

[1] - http://law2.umkc.edu/faculty/projects/ftrials/shipp/lynchingsstate.html
[2] - http://law2.umkc.edu/Faculty/projects/ftrials/shipp/lynchingyear.html

Friday, October 14, 2016

Launching @BestOfData

[Image Credit: Gizmodo]

Since starting HallwayMathlete, many of my friends started having an interest in data visualization and all around data journalism. They have asked for suggestions on other websites to get top quality stories told through data analysis. The easiest answer is: FiveThirtyEight. However, there are many smaller blogs that turn out fantastic quality material, but in low volume. These smaller websites tend not to have Twitters, Facebook Pages, or other venues for fans to follow their most recent work.

For this reason, I am launching @BestOfData. Best of Data is an aggregation of the best smaller data journalism blogs on the web. Currently, Best of Data automatically pulls from a list of my favorite blogs and I will manual add other articles I find interesting.

The complete list of website automatically populating @BestOfData:
FlowingData.com
HowMuch.net
RandalOlson.com/Blog
Priceonomics.com
InformationIsBeautiful.net
ToddwSchneider.com
Insidesamegrain.com

Any websites you think I should add to this list? Please add them in the comments.

If you are not a Twitter user (you should really have a Twitter), the list of articles can be found at BestOfData.org.

Sunday, August 7, 2016

Which is the best Pokemon in Pokemon GO?

Below are interactive graphics that display the average max CP of each Pokemon by trainer level. The first graph is of only the top Pokemon in Pokemon GO. The final graph shows all Pokemon and the following graphs are grouped by type.

Tips:

If you want to know which Pokemon a line represents, place your mouse on the line and the information will be displayed.
To remove lines from the plot click the Pokemon's name in the legend.
To see a graph in greater detail, click and drag over the region you wish to observe.

Note: The graphics look better on desktop.

Top Pokemon
Best Pokemon:
1. Mewtwo

2. Dragonite

3. Mew

4. Moltres

5. Zapdos

6. Snorlax

7. Arcanine

8. Lapras

9. Articuno

10. Exeggutor

11. Vaporeon

12. Gyarados

13. Flareon

14. Muk

15. Charizard

Top Pokemon Currently Available In-Game

1. Dragonite

2. Snorlax

3. Arcanine

4. Lapras

5. Exeggutor

6. Vaporeon

7. Gyarados

8. Flareon

9. Muk

10. Charizard

Some interesting notes, Blastoise and Venusaur both are not in the top pokemon; this is surprising for many who played the Gameboy Games because the starter pokemon were always some of the most powerful. Also, Dragonite and Charizard are the only top 10 pokemon who have three evolution forms. All others are two evolutions besides, Snorlax and Lapras.

After trainer level 30 the CP limit for each Pokemon grows at a slower rate than previous levels.

All Pokemon

Magikarp is all the way at the bottom and by far the worst Pokemon in Pokemon GO.

By Type

The data for the above plots comes from Serebii.net. In the following weeks, Hallway Mathlete will post a web-scraping tutorial showing how the data was gathered.

Sunday, July 3, 2016

Analyzing the Annual Republicans vs. Democrats Congressional Baseball Game

Every year, the United States Congress takes a break from blocking each others bills and plays a charity baseball game. The best part, the teams are broken down by party lines, Republicans vs. Democrats. The tradition started in 1909 by Representative John Tener of Pennsylvania, a former professional baseball player. Last week was the annual game and Republicans were able to break a 7 year winning streak by the Democrats.

Below the net wins over the series is shown. The higher on the y axis the more Republican wins and the lower on the y axis the more Democrat wins. From this graph it is fairly obvious that each party has had long winning streaks. The gray dots represent years when the game was not held or I could not find any information about the game. In 1935, 1937, 1938, 1939, and 1941, games were held between members of congress and the press.

The following graph displays the points scored by each team over time. In the early years of the series, the games had much higher total scores than more recent years.

Next the point differentials were explored. The point differential is the difference between the scores of the two teams. Many of the closest games were held in the late seventies through the nineties. This time period also saw few winning streaks because the competition was fairly even between the parties.

A histogram was formed to understand the distribution of point differences. The Democrats have some extremely large wins with three wins over 20 points and the Republicans have none. Another interesting finding is that only one game ended in a tie. This is surprising because the charity event does not have overtime so it is logical to think out of the 81 games played more than one would end in a tie.

Over the years, the annual game has been held at many different locations. Each party has had different rates of success at each field. The winning percentage at each field was calculated to understand if either party has a home-field-advantage at any park. Langley High School is a bit of an outlier because it was selected as the location after two rain delays and only hosted one game. American League Park II and Georgetown Field were the first two stadiums to host the game and each only hosted one game. Memorial Stadium had the fourth fewest games with only four, but all other locations had nine or more games.

Ironically, RFK Stadium, named after the famous Democratic U.S. Senator, has given Republicans a strong home-field-advantage. Republicans have won 13 out of the 14 games played at the stadium. Democrats have seen similar success at Nationals Park; winning 7 out of the 9 games.

Currently, I am planning to update these graphs each year after the annual game. Please feel free to add ideas for additional graphs or analysis in the comment section.

Notes:

The data came from https://en.wikipedia.org/wiki/Congressional_Baseball_Game#Game_results
Some of the stadiums were renamed over the years and the original data set contained both names. For the analysis, the same stadiums were combined with the most recent name.