Uncovering People's Dirty Little Secrets on Venmo with Python Web Scraper
Of course, Venmo is perfect for splitting up the bills from that crazy night out with your friends. Yet, it is easy to forget about the pitfall of the convenience at your fingertip—your privacy compromised.
In fact, all payments you make through Venmo are publicly accessible unless you make them specifically private. These public data can offer some interesting insights, such as what people (supposedly) buy and how much they spend. While I was searching for an adequate topic for my final project for an NYU class called Python for App this spring, a more interesting, perhaps mischievous, question came to my mind: “What kind of bad things do people buy through Venmo?”
With some research, I was able to find relevant data on a website called Vicemo.com. Built by two developers, Mike Lacher and Chris Baker, Vicemo.com uses Venmo API and JavaScript to collect real-time, public Venmo transactions involving drugs, booze, and sex.
Voila!
Using web scraping tool will allow us to quickly collect the data without intensive coding involving the Venmo API.
Introduction
For this project, I use Python, a powerful tool for web scraping. However, after visiting Vicemo.com, I quickly realized its content is delievered completely through JavaScript. It can be challening to get the data from the website that depends on JavaScript to dynamically render its content. Using Python modules such as Requests and BeautifulSoup may not be so fruitful as the content on such websites loads only when the user activity on the browser calls the corresponding JavaScript. While struggling to solve this problem, I found a potential solution from a blog post by Todd Hayton, a freelance software developer. An effective way to get around this problem is to use Selenium, a tool for automating browsers, and PhantomJS, a non-GUI (or headless) browser.
Implementation
First, we need to install PhantomJS and Selenium bindings for Python. I am quoting the code snippet from Hayton’s blog post.
Or, you can download PhantomJS and Selenium manually and place them in your virtual environment library. The below are the links:
- PhantomJS: http://phantomjs.org/download.html
- Selenium: http://www.seleniumhq.org/download/
Now, let’s see how this is done.
There are three tasks that need to be done in the following order:
- Scrape web elements
- Parse HTML elements
- Cleanse data
We create three classes to do these tasks–LetMeScrapeThat
, LetMeParseThat
, and LetMeAnalyzeThat
. An additional class that instanciates these three classes is VicemoScraper
. The below is the code snippet that showcases how PhantomJS and Selenium are used.
First, the class, LetMeScrapeThat
, instantiates a PhantomJS headless browser by calling webdriver.PhantomJS()
. We then access the target website via the method self.phantom_webpage.get(link)
. Note how the method sleep()
from time
is called to let JavaScript render the desired data completely. In addition, the for loop in the method scrape_vicemo()
allows us to scroll down in our PhantomJS browser to access more data, which are rendered dynamically as the user scrolls down the page. Finally, we extract the HTML codes from the web elements and store them in the list variable self.transactions
.
Let’s now take a look at LetMeParseThat
, which parses HTML codes to extract the data. Here is the example how the data might look like after being rendered by JavaScript.
Note that the description of the transaction is within <div>
tag with class="description"
. Also, note how the emojis are represented by the attribute title=emoji-name
of the <span>
tags.
Take a look at the code snippet for LetMeParseThat()
class below.
As observed, the description of the transactions comes in both strings and emojis, so we use extract_string_data()
and extract_emoji_data()
methods to extract the strings and emojis from HTML accordingly.
Moreover, these data need to be cleansed before we can use them. The class LetMeAnalyzeThat
does this job.
Note most commonly used English words are hard coded to help the program filter out trivial words. Remember, we are only interested in what Venmo users are paying for. Using Regex, we get rid of special characters and whitespaces from words. The clenased words are then added to a compiler. A similar process is carried out to extract emoji data. The resulting outputs are dictionaries containing the objects or activities Venmo users paid for and how many times it appears in the collected data. Finally, we have all the tools to obtain the data.
Conclusion
The working version of the codes from this project is available on Github. Here is the link. There are many different ways to use this data. The below is an example of visualization that I created using Tableau Public with the data that were obtained on July 27th.