SideStories XIII: Scraping XML
Okay so here’s the thing, I’m a curious bloke when it comes to learning new things in coding and yesterday I found myself learning about XML (short for Extensible Markup Language), which, according to Wikipedia:
Extensible Markup Language is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
I was having a skim through my directories on my Mac and found a curious “Facebook” file which contained an XML file. I opened the file in Atom and found it to be readable data which contained my UUID (Unique User ID) and email for Facebook and so forth. Curious, I thought — A file which stores the specific data associated with my Facebook account, sitting here in this file.
To me, this was a file which was for storage and transfer purposes, maybe a copy sitting in my files as a gift to me, but it was neatly tucked away in my Library.
I slept on it then thought a bit more about how this file could have been formed. I was wanting to learn about how one can write code in this form for data storage and thought about recalling that data later, or parsing through a website and pulling specific data out of the website, and pinging that data into the XML document. I had a little Google around and found the term “Scraping” which means when you go through a page and pull specific data from a site, for spreadsheets etc. This could be useful in the case of wanting to understand customer behaviour better or find specific history over time of a particular thing.
I found an excellent breakdown by Shirish Gupta of XML Scraping here, which seems to go about the process using a library called “Beautiful Soup” with Python. I wanted to give it a go myself so I copied the code into Atom (purely for educational purposes!) and tried to run it through Terminal using the command line “python (file-name).py” but the file wouldn’t run. Long story short I am currently downloading Python into my Mac to try and understand and implement this technique. It’s taking a while.
Let’s see how that goes! In the meantime I’ve made a simple XML document too as a first which you can find here on my GitHub.