Inspired by some graphs
dr_tectonic produced which showed graphical representations of individual songs, I got distracted this morning and started thinking about who songs are written about.
As it happens, I have a (gendered) list of names on my computer from previous research, and I also have a dump of OLGA (the OnLine Guitar Archive, now defunct due to pressure from the MPA and NMPA) for my own personal, educational and research use. That's about 10,000 songs from about 1,100 artists; it's worth noting the list is biased towards English-language popular music with guitars in it that can be represented with chords or tab (i.e. rock/blues/pop) from the last fifty or so years.
There are a few things that make this problem difficult. First, identifying names is hard. I've assumed that they're capitalized words in songs that appear on the name list. That does cause some problems: there's a lot of names that are common words ('Will', 'Hope', 'Van', etc). Second, there's no XML here: it's all just flat text files, in directories by first eight characters of the band name. Third, identifying gender of names is a whole problem unto itself. Fourth, I'm assuming that anonymous transcribers of songs are scrupulous about capitalization -- no "layla! i get down on my knees". Fifth, I don't want this to be more biased than necessary by the names of the artists: I want to know who they're writing about, not who's doing the writing. And sixth, I'm count each name each time it shows up, not once-per-song, so the name 'Layla' gets nine hits, despite the fact it's probably one song. Makes you wonder why I even bother trying. So the code has a lot of hedging to try and get around those problems[1], *and* I have to manually go into the result and decide what I think are and are not names.
But, in conclusion, out of 1,255,417 lines in 10,296 songs, the following names are mentioned more than 50 times:
will 464 #likely not a name most of the time
jesus 199
john 163
joe 144
america 147 #wierd name list
dan 108
johnny 108
billy 108
mary 102 #first female name
paul 78
van 72 #mainly due to a lot of non-english songs, not mr. morrison
james 69
peter 69
jack 66
tom 66
sally 63
jimmy 62
santa 59
ray 59
polly 55
willie 51
Aren't you glad you didn't ask?
J
[1]: I skip the first five lines of the file, I skip any lines with 'by' or 'artist' in them, I skip lines which are All In Title Case, I skip lines that have the name of the directory in them [ie some approximation of the artist], and I skip the words ["you", "love", "come", "song", "into", "set","straight", "christmas","lady","round","york","melody","young"] which are in the names list but seem unlikely in this context to be informative.
For more detail, you might want to check out the
complete output of names and frequencies here.
You could, but it's unlikely, want to check out the
code here or
the [gendered] namelist here.
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
As it happens, I have a (gendered) list of names on my computer from previous research, and I also have a dump of OLGA (the OnLine Guitar Archive, now defunct due to pressure from the MPA and NMPA) for my own personal, educational and research use. That's about 10,000 songs from about 1,100 artists; it's worth noting the list is biased towards English-language popular music with guitars in it that can be represented with chords or tab (i.e. rock/blues/pop) from the last fifty or so years.
There are a few things that make this problem difficult. First, identifying names is hard. I've assumed that they're capitalized words in songs that appear on the name list. That does cause some problems: there's a lot of names that are common words ('Will', 'Hope', 'Van', etc). Second, there's no XML here: it's all just flat text files, in directories by first eight characters of the band name. Third, identifying gender of names is a whole problem unto itself. Fourth, I'm assuming that anonymous transcribers of songs are scrupulous about capitalization -- no "layla! i get down on my knees". Fifth, I don't want this to be more biased than necessary by the names of the artists: I want to know who they're writing about, not who's doing the writing. And sixth, I'm count each name each time it shows up, not once-per-song, so the name 'Layla' gets nine hits, despite the fact it's probably one song. Makes you wonder why I even bother trying. So the code has a lot of hedging to try and get around those problems[1], *and* I have to manually go into the result and decide what I think are and are not names.
But, in conclusion, out of 1,255,417 lines in 10,296 songs, the following names are mentioned more than 50 times:
will 464 #likely not a name most of the time
jesus 199
john 163
joe 144
america 147 #wierd name list
dan 108
johnny 108
billy 108
mary 102 #first female name
paul 78
van 72 #mainly due to a lot of non-english songs, not mr. morrison
james 69
peter 69
jack 66
tom 66
sally 63
jimmy 62
santa 59
ray 59
polly 55
willie 51
Aren't you glad you didn't ask?
J
[1]: I skip the first five lines of the file, I skip any lines with 'by' or 'artist' in them, I skip lines which are All In Title Case, I skip lines that have the name of the directory in them [ie some approximation of the artist], and I skip the words ["you", "love", "come", "song", "into", "set","straight", "christmas","lady","round","york","melody","young"] which are in the names list but seem unlikely in this context to be informative.
For more detail, you might want to check out the
complete output of names and frequencies here.
You could, but it's unlikely, want to check out the
code here or
the [gendered] namelist here.