Sputnikmusic - Statnik 1

Statnik 1

By macman76 Wednesday July 19, 2017

Hello world, and welcome to a first-of-its-kind staff blog, one written by someone with no reviews and pedestrian/almost non-existent music taste. I joined the site when I was trying to find something to fall deeply into, and I thought being the only person I knew that liked Led Zeppelin meant that I could become a SERIOUS music listener. Of course, I failed and, besides a real weak stream of bands I like, I don’t listen to much music. As a result, I stopped regularly visiting the site after maybe 6 months of being a regular commenting member. Nevertheless, I returned to the site because I found a different interest.

Now, I’m not an expert in this kind of stuff. I didn’t actually get a degree in the kind of thing that would make one a said expert. More than anything I’m a diligent and creative googler, the level at which you can fake expertise. I like data. Since I decide I liked data, I have done SERIOUS data guy stuff. I started to keep track of my stats in video games like COD: BO and Rocket League, and I analyzed this here website. (That’s it. There’s not really a third thing, I tried making a tool to help me with a fantasy football draft once, but people were drafting so fast it actually was probably more costly than useful.)

So, as best I could tell, these blogs will go something like this: I’ll write some kind of description of some cool thing I’ve done/am doing with the data on this site, maybe there will be a story of some sort, and then I’ll present code for how to do it. You’ll cheer, you’ll cry, and you’ll learn stuff. I don’t know how regularly I’ll make you cheer/cry/learn, but it will not be never!

The first thing I had to learn to do any of the lists I have done, was figure out how to grab data (sometimes referred to as data scraping or munging or back-alley mugging) from the website. I initially tried copying the ratings from soundoff pages by GoogleChrome:Right Click > Inspect and copying the table objects into excel. But that was tedious and unscalable, so then I found out how to do it with my current go-to language, R.

(It’s free! To install R, go this website https://cran.r-project.org/mirrors.html, download from any mirror you like, and then download this handy IDE https://www.rstudio.com/ to make working with R a breeze rather than a blast-from-the-past-trembling-fear-inducing chore caused by the standard RGUI.)

R is a statistics-focused language. Packages and online code examples are often written for and by college math/stats/information sciences department people. It’s relatively simple to use and follow, and there are a lot of free books, moocs, and blog posts on how to use R. But what makes R especially good, is that it’s free. It doesn’t cost thousands of dollars like other comparable languages do (Matlab… SAS… Stata). End paid advertisement.

R has a package that lets you load webpage html code as text-like objects. (Some websites don’t let you read their html code with R and presumably with any other language, and I don’t know why or how, but it happens.) It has functions to let you interact with html code, such as finding specific html tags, external html links found on pages, and (a third thing!), importantly, html tables. The rating data, as well as a lot of things on sputnikmusic, are stored in html tables. If you install the packages dplyr (a very cool data manipulation package) and XML (the package to read web html) with the following code:
install.packages(c("dplyr","XML"))
and then run the following code in the R console,
library(dplyr) library(XML) scrape_soundoff <- function(obj,link){ # function to scrape data if(!any(class(obj) %in% c("HTMLInternalDocument", "HTMLInternalDocument", "XMLInternalDocument", "XMLAbstractDocument")) && is.character(obj)){ link <- obj if(!grepl('/soundoff.php',obj)) stop('Character string provided is not a soundoff page') obj <- htmlParse(obj) } if(!any(class(obj) %in% c("HTMLInternalDocument", "HTMLInternalDocument", "XMLInternalDocument", "XMLAbstractDocument")) && !is.character(obj)) { stop(paste0('input is not an html object or a character.', 'if calling this function directly,', ' ensure that your input is a character ', 'string that is the name of sputnikmusic soundoff page.')) } user_links <- grep('/user/',getHTMLLinks(obj),value = TRUE) links <- getHTMLLinks(obj) if(length(grep('/best/albums/',links))>1){ # if theirs more than one link to best albums # ... that means the release year is in the second link release.year <- as.numeric(tail(unlist(strsplit(tail(grep('/best/albums/',links,value = TRUE),1),'/')),1)) }else{ release.year <- as.numeric(tail(unlist(strsplit(unlist(lapply(xpathSApply(obj, "//b"),xmlToList)[[2]]),'/')),1)) } dat<-readHTMLTable(obj,which = 1) if(any(grepl('http:/',user_links))) user_links <- user_links[-grep('http:/',user_links)] dat$V2 <- as.character(dat$V2) names(dat)[1] <- 'Rating' dat <- dat[!is.na(dat$V2),] dat <- dat[c(1,2)] dat$Rating <- as.numeric(substr(dat$Rating,1,3)) dat <- dat[!is.na(dat$Rating),] dat <- dat[2:nrow(dat),] # for (i in 1:length(dat$V2)) split_rating <- function(dat){ tmpstr <- strsplit(x = dat, split = ' | ',fixed = TRUE) out <- data.frame(user = tmpstr[[1]][1],date = tmpstr[[1]][2]) return(out) } dat <- data.frame(dat,bind_rows(lapply(dat$V2,split_rating))) user_links <- user_links[dat$Rating>0] dat <- dat[dat$Rating>0,] dat <- dat[, c(1,3,4)] sputdate <- function(dates){ if(!is.character(dates)) stop('Not readable') s<-c() for (i in 1:length(dates)) { tmpdate <- unlist(strsplit(dates[i], ' ')) if (length(tmpdate)==1){ s[i] <- NA} else { mo <- grep(strsplit(tmpdate, ' ')[1],month.name) da <- as.numeric(substr(tmpdate[2],1,nchar(tmpdate[2])-2)) ye <- paste0('20',tmpdate[3]) s[i] <- paste(ye,mo,da,sep = "/") } } s <- as.Date(s,"%Y/%m/%d") return(s) } dat$date <- sputdate(dat$date) trim.trailing <- function (x) sub("\\s+$", "", x) dat$user <- trim.trailing(dat$user) dat$userlinks <- substr(user_links,7,nchar(user_links)) dat$albumlink <- link dat <- data.frame(release.year,dat) dat <- dat[order(dat$date,decreasing = TRUE),] return(dat) } sputurl <- "http://www.sputnikmusic.com/soundoff.php?albumid=14363" dat <- scrape_soundoff(sputurl) print(head(dat))

… you will have read the soundoff page for the critically un-thought-of (seriously, it’s wiki page is empty) My Fruit Psychobells… A Seed Combustible album by the sputcore band “maudlin of the Well”. Your object, which is named “dat”, short for “data”, will be a table-like object containing the release year for the album, every rating on the soundoff page, with the listed name of every user (the one you can edit), as well as their official name (the one that has to be unique and appears on the url of someone’s profile page), the date of every rating, and the soundoff page link. You can print its contents by typing “dat”, sans quatotion marks, into the R console and hitting enter. You can replace the link in the “sputurl” string with whatever soundoff page you want, and you can get the ratings data for that album instead (i.e. sputurl <- “http://www.sputnikmusic.com/soundoff.php?albumid=xxxxx”).

You can then do things like, make a histogram of all the ratings, like can be found on a review page,
hist(dat$Rating,xlab = 'Rating',ylab='Count')
or you can plot each rating as a time series,
with(dat,plot(date,Rating,'l',main='Timeseries'))
or a time series with a smoothed trend line.
with(dat,plot(date,Rating,'l',main='Smoothed Timeseries')) lines(dat$date[!is.na(dat$date)], loess(Rating~as.numeric(date),dat,span = .5)$fitted,'l',col = 'red')
And that’s just the start of what you can do when you have a big imagination… and google.

P.S. Github link is here for a version of this code that will also add band, album, and genre tag information to the “dat” table.

29 Comments

macman76
07.19.17

Hi!
The code looks atrocious, I'll have to figure out how to present it cleanly.

AlexKzillion
07.19.17

Hey!

theBoneyKing
07.19.17

FINALLY

SandwichBubble
07.19.17

Ay it's here :D

brainmelter
07.19.17

v nice

henryChinaski
07.19.17

Wait...what?

Conmaniac
07.19.17

ayy macMAN what up?

macman76
07.19.17

Hey con

Conmaniac
07.19.17

been some time huh? glad ya got this up and running tho

macman76
07.19.17

Yeah man, the hold up was I didn't have a blog account until a couple of weeks ago, I should be doing these somewhat regularly

klap
07.19.17

beautiful

Storm In A Teacup
07.19.17

ooooh!!!! looks like i came back at the right time

Dewinged
07.20.17

Jesus macman, what is this sorcery?

RogueNine
07.20.17

There you are.

Winesburgohio
07.20.17

WE FIVETHIRTYEIGHT NOW

honestly though i don't think i've been this entranced by an article talking about data ever, great write-up!!

Trebor.
07.20.17

She be offering the thrussy but what I really want is STATS

macman76
07.20.17

Thanks wine, I hope I can be as entertaining with numbers as this https://youtu.be/JDZhPod87sw

ArsMoriendi
07.20.17

Shouldn't bluegrass be under country? And maybe Americana?

macman76
07.20.17

Reggae probably shouldn't be it's own super genre lol, it's the definition I used because I had to do something

Tyler.
07.20.17

can someone do this for me i am too lazy

TVC15
07.20.17

OH SHIT

tempest--
07.20.17

should have been called Sputistics smh

theacademy
07.20.17

someone with no reviews and pedestrian/almost non-existent music taste

hmmmmm

DaveyBoy
07.22.17

My brain just exploded... How do I get this R thingamajig on my Nokia 6610?

macman76
07.22.17

Short answer is you can't use it on mobile, needs to be on a computer

Spacesh1p
07.25.17

Nice intro to what you've been up to. I've been learning Python and have considered using Sput to practice some web scraping too. If I ever get around to it I'll let you know.

macman76
07.25.17

I know a little python, it's my understanding that beautifulsoup is a pretty comprehensive web scraping library I just haven't gotten around to learning it

macman76
07.25.17

and, yeah, I wouldn't mind having a guest-aided post

Spacesh1p
07.25.17

BS is quite good yeah. I use it at work for parsing article text from web scraping certain news sites that don't have APIs. I need to practice more so if you have any ideas of some analysis that would be interesting feel free to shoutbox me.

Leave a Response

Click here to cancel reply.

You need to be logged in to post a comment
Login | Register

Statnik 1

Leave a Response

Talking Points