Current Latest News
Technology

Visualizing Healthcare Budget using Web Scraping in Python

Hackernoon logo

Although it almost certainly would have been faster to manually enter the data in Excel, then I would not have had the invaluable opportunity to practice a few skills! Data science is about solving problems using a diverse set of tools, and web scraping and regular expressions are two areas I need some work on (not to mention that making plots is always fun). The result was a very short — but complete — project showing how we can bring together these three techniques to solve a data science problem.

Fetching data by making an HTTP requestExtracting important data by parsing the HTML DOM

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install matplotlib
pip install pandas

Now, create a file inside that folder by any name you like. I am using scraping.py. Then just import Beautiful Soup & requests in your file like shown below.

import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import requests
r = requests.get('https://api.scrapingdog.com/scrape?api_key=<YOUR_API_KEY>&url=https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD?most_recent_value_desc=false&dynamic=true').text

this will provide you with an HTML code of that target URL.

Now, you have to use BeautifulSoup to parse HTML.

soup = BeautifulSoup(r,’html.parser’)
country=list()
expense=list()

I have declared two empty lists to store the country names and expenses of each country in 24 hours.

As you can see each country is stored in an “item” tag. We’ll store all the item tag within a list.

try:
 Countries=soup.find_all(“div”,{“class”:”item”})
except:
 Countries=None

Since there are 190 countries in the world. We are going to run a for loop for each of that country.

for i in range(0,190):
  country.append(Countries[i+1].find_all(“div”,{“class”:None})[0].text.replace(“n”,””))
  expense.append(round(float(Countries[i+1].find_all(“div”,{“class”:None})[2].text.replace(“n”,””).replace(‘,’,’’)))/365)
 
Data = {‘country’: country,’expense’: expense}

I have divided the expense by 365 because I want to see how these countries spent money on an everyday basis. Obviously this could have been easier if I have directly divided the given data by 365 but then there is no point in learning right?

{'country': ['Central African Republic', 'Burundi', 'Mozambique', 'Congo, Dem. Rep.', 'Gambia, The', 'Niger', 'Madagascar', 'Ethiopia', 'Malawi', 'Mali', 'Eritrea', 'Benin', 'Chad', 'Bangladesh', 'Tanzania', 'Guinea', 'Uganda', 'Haiti', 'Togo', 'Guinea-Bissau', 'Pakistan', 'Burkina Faso', 'Nepal', 'Mauritania', 'Rwanda', 'Senegal', 'Papua New Guinea', 'Lao PDR', 'Tajikistan', 'Zambia', 'Afghanistan', 'Comoros', 'Myanmar', 'India', 'Cameroon', 'Syrian Arab Republic', 'Kenya', 'Ghana', "Cote d'Ivoire", 'Liberia', 'Djibouti', 'Congo, Rep.', 'Yemen, Rep.', 'Kyrgyz Republic', 'Cambodia', 'Nigeria', 'Timor-Leste', 'Lesotho', 'Sierra Leone', 'Bhutan', 'Zimbabwe', 'Angola', 'Sao Tome and Principe', 'Solomon Islands', 'Vanuatu', 'Indonesia', 'Vietnam', 'Philippines', 'Egypt, Arab Rep.', 'Uzbekistan', 'Mongolia', 'Ukraine', 'Sudan', 'Iraq', 'Sri Lanka', 'Cabo Verde', 'Moldova', 'Morocco', 'Fiji', 'Kiribati', 'Nicaragua', 'Guyana', 'Honduras', 'Tonga', 'Bolivia', 'Gabon', 'Eswatini', 'Thailand', 'Jordan', 'Samoa', 'Guatemala', 'St. Vincent and the Grenadines', 'Tunisia', 'Algeria', 'Kazakhstan', 'Azerbaijan', 'Albania', 'Equatorial Guinea', 'El Salvador', 'Jamaica', 'Belize', 'Georgia', 'Libya', 'Peru', 'Belarus', 'Paraguay', 'North Macedonia', 'Colombia', 'Suriname', 'Armenia', 'Malaysia', 'Botswana', 'Micronesia, Fed. Sts.', 'China', 'Namibia', 'Dominican Republic', 'Iran, Islamic Rep.', 'Dominica', 'Turkmenistan', 'South Africa', 'Bosnia and Herzegovina', 'Mexico', 'Turkey', 'Russian Federation', 'Romania', 'St. Lucia', 'Serbia', 'Ecuador', 'Tuvalu', 'Grenada', 'Montenegro', 'Mauritius', 'Seychelles', 'Bulgaria', 'Antigua and Barbuda', 'Brunei Darussalam', 'Oman', 'Lebanon', 'Poland', 'Marshall Islands', 'Latvia', 'Croatia', 'Costa Rica', 'St. Kitts and Nevis', 'Hungary', 'Argentina', 'Cuba', 'Lithuania', 'Nauru', 'Brazil', 'Panama', 'Maldives', 'Trinidad and Tobago', 'Kuwait', 'Bahrain', 'Saudi Arabia', 'Barbados', 'Slovak Republic', 'Estonia', 'Chile', 'Czech Republic', 'United Arab Emirates', 'Uruguay', 'Greece', 'Venezuela, RB', 'Cyprus', 'Palau', 'Portugal', 'Qatar', 'Slovenia', 'Bahamas, The', 'Korea, Rep.', 'Malta', 'Spain', 'Singapore', 'Italy', 'Israel', 'Monaco', 'San Marino', 'New Zealand', 'Andorra', 'United Kingdom', 'Finland', 'Belgium', 'Japan', 'France', 'Canada', 'Austria', 'Germany', 'Netherlands', 'Ireland', 'Australia', 'Iceland', 'Denmark', 'Sweden', 'Luxembourg', 'Norway', 'Switzerland', 'United States', 'World'], 'expense': [0.043835616438356165, 0.049315068493150684, 0.052054794520547946, 0.057534246575342465, 0.057534246575342465, 0.06301369863013699, 0.06575342465753424, 0.07671232876712329, 0.0821917808219178, 0.0821917808219178, 0.0821917808219178, 0.0821917808219178, 0.08767123287671233, 0.09315068493150686, 0.09863013698630137, 0.10136986301369863, 0.10410958904109589, 0.10410958904109589, 0.10684931506849316, 0.10684931506849316, 0.1095890410958904, 0.11232876712328767, 0.1232876712328767, 0.12876712328767123, 0.13150684931506848, 0.14520547945205478, 0.1506849315068493, 0.1506849315068493, 0.15342465753424658, 0.15616438356164383, 0.15616438356164383, 0.16164383561643836, 0.16986301369863013, 0.1726027397260274, 0.17534246575342466, 0.18082191780821918, 0.18082191780821918, 0.1863013698630137, 0.1863013698630137, 0.1863013698630137, 0.1917808219178082, 0.1917808219178082, 0.19726027397260273, 0.2, 0.2136986301369863, 0.21643835616438356, 0.2191780821917808, 0.2356164383561644, 0.2356164383561644, 0.2493150684931507, 0.25753424657534246, 0.2602739726027397, 0.2876712328767123, 0.29041095890410956, 0.3013698630136986, 0.30684931506849317, 0.336986301369863, 0.35342465753424657, 0.3589041095890411, 0.3698630136986301, 0.3863013698630137, 0.3863013698630137, 0.41643835616438357, 0.4191780821917808, 0.4191780821917808, 0.43561643835616437, 0.4684931506849315, 0.4684931506849315, 0.4931506849315068, 0.5150684931506849, 0.5150684931506849, 0.5260273972602739, 0.547945205479452, 0.5561643835616439, 0.5835616438356165, 0.6027397260273972, 0.6054794520547945, 0.6082191780821918, 0.6136986301369863, 0.6219178082191781, 0.6602739726027397, 0.684931506849315, 0.7013698630136986, 0.7123287671232876, 0.7178082191780822, 0.7342465753424657, 0.7452054794520548, 0.7698630136986301, 0.8054794520547945, 0.810958904109589, 0.8328767123287671, 0.8438356164383561, 0.8575342465753425, 0.8657534246575342, 0.8712328767123287, 0.8958904109589041, 0.8986301369863013, 0.9315068493150684, 0.9753424657534246, 0.9835616438356164, 0.9917808219178083, 1.0410958904109588, 1.0602739726027397, 1.0904109589041096, 1.104109589041096, 1.1342465753424658, 1.1369863013698631, 1.1479452054794521, 1.158904109589041, 1.1726027397260275, 1.2164383561643837, 1.2657534246575342, 1.284931506849315, 1.284931506849315, 1.3041095890410959, 1.3424657534246576, 1.3534246575342466, 1.3835616438356164, 1.389041095890411, 1.4136986301369863, 1.4575342465753425, 1.515068493150685, 1.6356164383561644, 1.6767123287671233, 1.7068493150684931, 1.7287671232876711, 1.7753424657534247, 1.8136986301369864, 2.2164383561643834, 2.3315068493150686, 2.3945205479452056, 2.421917808219178, 2.4356164383561643, 2.5506849315068494, 2.5835616438356164, 2.6164383561643834, 2.66027397260274, 2.706849315068493, 2.7726027397260276, 2.7835616438356166, 2.852054794520548, 2.871232876712329, 2.915068493150685, 2.926027397260274, 3.010958904109589, 3.1424657534246574, 3.1890410958904107, 3.23013698630137, 3.2465753424657535, 3.263013698630137, 3.621917808219178, 3.6246575342465754, 3.778082191780822, 4.13972602739726, 4.323287671232877, 4.476712328767123, 4.586301369863014, 4.934246575342466, 5.005479452054795, 5.024657534246575, 5.027397260273973, 5.6, 6.3780821917808215, 6.5479452054794525, 6.745205479452054, 7.504109589041096, 7.772602739726027, 8.054794520547945, 8.254794520547945, 10.26027397260274, 10.506849315068493, 10.843835616438357, 11.27945205479452, 11.367123287671232, 11.597260273972603, 11.67945205479452, 12.213698630136987, 12.843835616438357, 12.915068493150685, 12.991780821917809, 13.038356164383561, 13.704109589041096, 13.873972602739727, 15.24931506849315, 15.646575342465754, 17.18082191780822, 20.487671232876714, 26.947945205479453, 27.041095890410958, 2.8109589041095893]}

Creating one is very simple and straightforward.

This project is indicative of data science because the majority of time was spent collecting and formatting the data. However, now that we have a clean dataset, we get to make some plots! We can use both matplotlib and seaborn to visualize the data.

If we aren’t too concerned about aesthetics, we can use the built-in dataframe plot method to quickly show results:

I know the names of the countries are pretty small. But you can download and analyze it. Now, the main thing which you can see is that many countries are spending way less than a dollar, which is pretty shocking. Majority Countries are from Asia and Africa. In my opinion WHO should focus more on these countries rather than developed countries in the west.

This is not necessarily a publication-worthy plot, but it’s a nice way to wrap up a small project.

The most effective way to learn technical skills is by doing. While this whole project could have been done manually inserting values into Excel, I like to take the long view and think about how the skills learned here will help in the future. The process of learning is more important than the final result, and in this project, we were able to see how to use 3 critical skills for data science:

Web Scraping: Retrieving online data using BeautifulSoup: Parsing our data to extract information for Visualization: Showcasing all our hard work

Now, get out there and start your own project and remember: it doesn’t have to be world-changing to be worthwhile.

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Source link

Related posts

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy