JSON Data Web Scraping
24 November 2023
Category: Articles
Modified: 24.11.2023
Tags:
The topic: IR IS
Part 1 out of 2.
Next topic articles:
 
While working on the monograph in 2021, I analyzed data on information systems and resources registered in Belarus. Some time has passed since then, and on the pages of this blog I will repeat the analysis, but using new approaches to both analysis and visualization of results.
Our analysis will consist of a series of articles, and this article is the first of them.
The difficulty we had to face in the beginning is that the data is presented in a convenient way for users, but not for analysts. In particular, there is an opportunity to add or remove columns, search the registry, but there is no convenient way to download all records. Given that the records are in JSON format, conventional web scraping techniques are not suitable, for example the Beautiful Soup library is almost useless.
However, the JSON format is even more convenient in practice, since it loads all records into the browser.
Here and hereafter, we perform the analysis using Python in a Jupyter Notebook.
First, download the necessary libraries.
import pandas as pd
import requests
import json
import os
Information Systems
Let’s collect information from the Chrome browser to establish an automatic connection.
# in the Chrome browser, look here: inspector (network -> Fetch/XHR > Headers and Responce)
endpoint_sys = 'http://xn--c1akxf.xn--90ais/api/systemRegister/list'
header_sys = {'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Type': 'application/json; charset=UTF-8',
'Origin': 'http://xn--c1akxf.xn--90ais',
'Referer': 'http://xn--c1akxf.xn--90ais/app/registerIS',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)\
AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/114.0.0.0 Safari/537.36'}
# make query based on the Chrome browser inspector data (copy as cURL), but we only need a few lines of code
query_sys = f'{{"page":1,"rows":-1}}' # also pay attention to the doubling of those "{" brackets
Let’s move on to direct data loading.
# load data
response_sys = requests.post(endpoint_sys, headers=header_sys, data=query_sys)
# in the responce page we see that only rows are our main data, so change query a bit
parsed_sys = json.loads(response_sys.content)['rows']
# convert JSON data into DataFrame
inf_sys = pd.DataFrame.from_dict(data=parsed_sys)
# list the required columns
columns_sys = ['numberOnRegistration', 'dateOnRegistration', 'dateActyalization',\
'dateExclude', 'stateNotstateName', 'fullNameIs', 'shortNameIs',\
'appointmentIs', 'functionIs', 'nameViewsIs', 'nameViewsStructures',\
'nameSizeIs', 'clientsName', 'operatorsName', 'ownersName',\
'developersName', 'proprietorsName']
# limit the DataFrame to the required columns
inf_sys = inf_sys[columns_sys]
# convert dates to datetime objects
date_sys = ['dateOnRegistration', 'dateActyalization', 'dateExclude']
inf_sys[date_sys] = inf_sys[date_sys].apply(pd.to_datetime, format='%d.%m.%Y')
print(inf_sys.info()) # display information about the DataFrame
Click the button below to view the information about the DataFrame.
show info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 numberOnRegistration 435 non-null object
1 dateOnRegistration 435 non-null datetime64[ns]
2 dateActyalization 111 non-null datetime64[ns]
3 dateExclude 38 non-null datetime64[ns]
4 stateNotstateName 435 non-null object
5 fullNameIs 435 non-null object
6 shortNameIs 435 non-null object
7 appointmentIs 435 non-null object
8 functionIs 435 non-null object
9 nameViewsIs 433 non-null object
10 nameViewsStructures 434 non-null object
11 nameSizeIs 433 non-null object
12 clientsName 417 non-null object
13 operatorsName 356 non-null object
14 ownersName 434 non-null object
15 developersName 434 non-null object
16 proprietorsName 435 non-null object
dtypes: datetime64[ns](3), object(14)
memory usage: 57.9+ KB
None
Save the DataFrame to a file so as not to upload it again.
# save to file
os.makedirs('data', exist_ok=True)
inf_sys.to_csv('data/inf_sys.csv')
In the following articles, we will start the analysis by downloading this file.
Information Resources
In Belarus, according to the legislation, in addition to information systems information resources are also subject to registration. Thus, it is necessary to repeat the above operations. We only change the web address.
# repeat previous steps
endpoint_res = 'http://xn--c1akxf.xn--90ais/api/resourceRegister/list'
header_res = {'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Type': 'application/json; charset=UTF-8',
'Origin': 'http://xn--c1akxf.xn--90ais',
'Referer': 'http://xn--c1akxf.xn--90ais/app/registerIR',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)\
AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/114.0.0.0 Safari/537.36'}
query_res = f'{{"page":1,"rows":-1}}'
Several hundred information systems are registered, but there are more than 35,000 resources, so web scraping can take several minutes. The amount of data is more than a gigabyte.
response_res = requests.post(endpoint_res, headers=header_res, data=query_res)
parsed_res = json.loads(response_res.content)['rows']
inf_res = pd.DataFrame.from_dict(data=parsed_res)
columns_res = ['numberOnRegistration', 'dateOnRegistration', 'dateActualization',\
'dateExclude', 'fullNameSource', 'shortNameSource', 'dbDepart', 'rtypeName',\
'themesName', 'rubricName', 'content', 'dbSizeMb', 'dbSizeRec', 'dbLang',\
'dbCreate', 'dbRetroBeg', 'dbRetroEnd', 'datasource', 'niokrFlag', 'niokrName',\
'bePart', 'relationFlag', 'relation', 'programmShell', 'safetyRequirements',\
'dtiBaseInfo', 'rstatus', 'dataChangeRstatus', 'dbDelivery', 'ownerName',\
'developersName', 'placesName', 'urlsName', 'serviceName', 'securitiesName',\
'informationObjectsInfo']
inf_res = inf_res[columns_res]
date_res = ['dateOnRegistration', 'dateActualization', 'dateExclude', 'dtiBaseInfo',\
'dataChangeRstatus']
inf_res[date_res] = inf_res[date_res].apply(pd.to_datetime, format='%d.%m.%Y')
print(inf_res.info())
Details about DataFrame can be seen below.
show info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36442 entries, 0 to 36441
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 numberOnRegistration 36442 non-null object
1 dateOnRegistration 36442 non-null datetime64[ns]
2 dateActualization 5975 non-null datetime64[ns]
3 dateExclude 2534 non-null datetime64[ns]
4 fullNameSource 36442 non-null object
5 shortNameSource 36442 non-null object
6 dbDepart 33664 non-null object
7 rtypeName 36442 non-null object
8 themesName 36442 non-null object
9 rubricName 36442 non-null object
10 content 36436 non-null object
11 dbSizeMb 36183 non-null object
12 dbSizeRec 1656 non-null object
13 dbLang 36396 non-null object
14 dbCreate 36420 non-null object
15 dbRetroBeg 33881 non-null object
16 dbRetroEnd 16794 non-null object
17 datasource 29602 non-null object
18 niokrFlag 35994 non-null float64
19 niokrName 1046 non-null object
20 bePart 1356 non-null object
21 relationFlag 35936 non-null float64
22 relation 1824 non-null object
23 programmShell 36286 non-null object
24 safetyRequirements 16775 non-null object
25 dtiBaseInfo 36382 non-null datetime64[ns]
26 rstatus 36442 non-null object
27 dataChangeRstatus 36442 non-null datetime64[ns]
28 dbDelivery 16688 non-null object
29 ownerName 36440 non-null object
30 developersName 35898 non-null object
31 placesName 32446 non-null object
32 urlsName 12592 non-null object
33 serviceName 32427 non-null object
34 securitiesName 5274 non-null object
35 informationObjectsInfo 839 non-null object
dtypes: datetime64[ns](5), float64(2), object(29)
memory usage: 10.0+ MB
None
Save the received DataFrame to a file.
# save to file
os.makedirs('data', exist_ok=True)
inf_res.to_csv('data/inf_res.csv')
The file size turned out to be 82.8 megabytes, much smaller than the original. This difference is explained by the fact that many data not needed for analysis are not stored. For example, information about the names and contacts of employees of organizations that entered this information into the register.
Therefore, we downloaded publicly available data on registered information systems and resources for analysis. Moreover, we pre-cleaned the data from the information we don’t need and changed the data format for the dates. In the next article, we will continue to clean the data more thoroughly and look at the most promising data.