JSON Data Web Scraping

24 November 2023

Category: Articles

Modified: 24.11.2023

Tags:


The topic: IR IS

Part 1 out of 2.

Next topic articles:

  1. Time Series: Importing & Cleaning Data

 

Share on:

While working on the monograph in 2021, I analyzed data on information systems and resources registered in Belarus. Some time has passed since then, and on the pages of this blog I will repeat the analysis, but using new approaches to both analysis and visualization of results.

Our analysis will consist of a series of articles, and this article is the first of them.

The difficulty we had to face in the beginning is that the data is presented in a convenient way for users, but not for analysts. In particular, there is an opportunity to add or remove columns, search the registry, but there is no convenient way to download all records. Given that the records are in JSON format, conventional web scraping techniques are not suitable, for example the Beautiful Soup library is almost useless.

However, the JSON format is even more convenient in practice, since it loads all records into the browser.

Here and hereafter, we perform the analysis using Python in a Jupyter Notebook.

First, download the necessary libraries.

import pandas as pd
import requests
import json
import os

Information Systems

Let’s collect information from the Chrome browser to establish an automatic connection.

# in the Chrome browser, look here: inspector (network -> Fetch/XHR > Headers and Responce)
endpoint_sys = 'http://xn--c1akxf.xn--90ais/api/systemRegister/list'
header_sys = {'Accept': 'application/json, text/plain, */*',
          'Accept-Language': 'en-US,en;q=0.9',
          'Connection': 'keep-alive',
          'Content-Type': 'application/json; charset=UTF-8',
          'Origin': 'http://xn--c1akxf.xn--90ais',
          'Referer': 'http://xn--c1akxf.xn--90ais/app/registerIS',
          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)\
          AppleWebKit/537.36 (KHTML, like Gecko)\
          Chrome/114.0.0.0 Safari/537.36'}

# make query based on the Chrome browser inspector data (copy as cURL), but we only need a few lines of code
query_sys = f'{{"page":1,"rows":-1}}' # also pay attention to the doubling of those "{" brackets

Let’s move on to direct data loading.

# load data
response_sys = requests.post(endpoint_sys, headers=header_sys, data=query_sys)

# in the responce page we see that only rows are our main data, so change query a bit 
parsed_sys = json.loads(response_sys.content)['rows']

# convert JSON data into DataFrame
inf_sys = pd.DataFrame.from_dict(data=parsed_sys)

# list the required columns
columns_sys = ['numberOnRegistration', 'dateOnRegistration', 'dateActyalization',\
               'dateExclude', 'stateNotstateName', 'fullNameIs', 'shortNameIs',\
               'appointmentIs', 'functionIs', 'nameViewsIs', 'nameViewsStructures',\
               'nameSizeIs', 'clientsName', 'operatorsName', 'ownersName',\
               'developersName', 'proprietorsName']

# limit the DataFrame to the required columns
inf_sys = inf_sys[columns_sys]

# convert dates to datetime objects
date_sys = ['dateOnRegistration', 'dateActyalization', 'dateExclude']
inf_sys[date_sys] = inf_sys[date_sys].apply(pd.to_datetime, format='%d.%m.%Y')

print(inf_sys.info()) # display information about the DataFrame

Click the button below to view the information about the DataFrame.

show info
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 435 entries, 0 to 434
    Data columns (total 17 columns):
     #   Column                Non-Null Count  Dtype         
    ---  ------                --------------  -----         
     0   numberOnRegistration  435 non-null    object        
     1   dateOnRegistration    435 non-null    datetime64[ns]
     2   dateActyalization     111 non-null    datetime64[ns]
     3   dateExclude           38 non-null     datetime64[ns]
     4   stateNotstateName     435 non-null    object        
     5   fullNameIs            435 non-null    object        
     6   shortNameIs           435 non-null    object        
     7   appointmentIs         435 non-null    object        
     8   functionIs            435 non-null    object        
     9   nameViewsIs           433 non-null    object        
     10  nameViewsStructures   434 non-null    object        
     11  nameSizeIs            433 non-null    object        
     12  clientsName           417 non-null    object        
     13  operatorsName         356 non-null    object        
     14  ownersName            434 non-null    object        
     15  developersName        434 non-null    object        
     16  proprietorsName       435 non-null    object        
    dtypes: datetime64[ns](3), object(14)
    memory usage: 57.9+ KB
    None

Save the DataFrame to a file so as not to upload it again.

# save to file
os.makedirs('data', exist_ok=True)
inf_sys.to_csv('data/inf_sys.csv') 

In the following articles, we will start the analysis by downloading this file.

Information Resources

In Belarus, according to the legislation, in addition to information systems information resources are also subject to registration. Thus, it is necessary to repeat the above operations. We only change the web address.

# repeat previous steps
endpoint_res = 'http://xn--c1akxf.xn--90ais/api/resourceRegister/list'
header_res = {'Accept': 'application/json, text/plain, */*',
          'Accept-Language': 'en-US,en;q=0.9',
          'Connection': 'keep-alive',
          'Content-Type': 'application/json; charset=UTF-8',
          'Origin': 'http://xn--c1akxf.xn--90ais',
          'Referer': 'http://xn--c1akxf.xn--90ais/app/registerIR',
          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)\
          AppleWebKit/537.36 (KHTML, like Gecko)\
          Chrome/114.0.0.0 Safari/537.36'}
query_res = f'{{"page":1,"rows":-1}}'

Several hundred information systems are registered, but there are more than 35,000 resources, so web scraping can take several minutes. The amount of data is more than a gigabyte.

response_res = requests.post(endpoint_res, headers=header_res, data=query_res)
parsed_res = json.loads(response_res.content)['rows']
inf_res = pd.DataFrame.from_dict(data=parsed_res)
columns_res = ['numberOnRegistration', 'dateOnRegistration', 'dateActualization',\
               'dateExclude', 'fullNameSource', 'shortNameSource', 'dbDepart', 'rtypeName',\
               'themesName', 'rubricName', 'content', 'dbSizeMb', 'dbSizeRec', 'dbLang',\
               'dbCreate', 'dbRetroBeg', 'dbRetroEnd', 'datasource', 'niokrFlag', 'niokrName',\
               'bePart', 'relationFlag', 'relation', 'programmShell', 'safetyRequirements',\
               'dtiBaseInfo', 'rstatus', 'dataChangeRstatus', 'dbDelivery', 'ownerName',\
               'developersName', 'placesName', 'urlsName', 'serviceName', 'securitiesName',\
               'informationObjectsInfo']
inf_res = inf_res[columns_res]
date_res = ['dateOnRegistration', 'dateActualization', 'dateExclude', 'dtiBaseInfo',\
            'dataChangeRstatus']
inf_res[date_res] = inf_res[date_res].apply(pd.to_datetime, format='%d.%m.%Y')
print(inf_res.info())

Details about DataFrame can be seen below.

show info
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 36442 entries, 0 to 36441
    Data columns (total 36 columns):
     #   Column                  Non-Null Count  Dtype         
    ---  ------                  --------------  -----         
     0   numberOnRegistration    36442 non-null  object        
     1   dateOnRegistration      36442 non-null  datetime64[ns]
     2   dateActualization       5975 non-null   datetime64[ns]
     3   dateExclude             2534 non-null   datetime64[ns]
     4   fullNameSource          36442 non-null  object        
     5   shortNameSource         36442 non-null  object        
     6   dbDepart                33664 non-null  object        
     7   rtypeName               36442 non-null  object        
     8   themesName              36442 non-null  object        
     9   rubricName              36442 non-null  object        
     10  content                 36436 non-null  object        
     11  dbSizeMb                36183 non-null  object        
     12  dbSizeRec               1656 non-null   object        
     13  dbLang                  36396 non-null  object        
     14  dbCreate                36420 non-null  object        
     15  dbRetroBeg              33881 non-null  object        
     16  dbRetroEnd              16794 non-null  object        
     17  datasource              29602 non-null  object        
     18  niokrFlag               35994 non-null  float64       
     19  niokrName               1046 non-null   object        
     20  bePart                  1356 non-null   object        
     21  relationFlag            35936 non-null  float64       
     22  relation                1824 non-null   object        
     23  programmShell           36286 non-null  object        
     24  safetyRequirements      16775 non-null  object        
     25  dtiBaseInfo             36382 non-null  datetime64[ns]
     26  rstatus                 36442 non-null  object        
     27  dataChangeRstatus       36442 non-null  datetime64[ns]
     28  dbDelivery              16688 non-null  object        
     29  ownerName               36440 non-null  object        
     30  developersName          35898 non-null  object        
     31  placesName              32446 non-null  object        
     32  urlsName                12592 non-null  object        
     33  serviceName             32427 non-null  object        
     34  securitiesName          5274 non-null   object        
     35  informationObjectsInfo  839 non-null    object        
    dtypes: datetime64[ns](5), float64(2), object(29)
    memory usage: 10.0+ MB
    None

Save the received DataFrame to a file.

# save to file
os.makedirs('data', exist_ok=True)
inf_res.to_csv('data/inf_res.csv') 

The file size turned out to be 82.8 megabytes, much smaller than the original. This difference is explained by the fact that many data not needed for analysis are not stored. For example, information about the names and contacts of employees of organizations that entered this information into the register.

Therefore, we downloaded publicly available data on registered information systems and resources for analysis. Moreover, we pre-cleaned the data from the information we don’t need and changed the data format for the dates. In the next article, we will continue to clean the data more thoroughly and look at the most promising data.



Data Subject © Anton Parfenchyk, 2023.

Powered by Pelican and Twitter Bootstrap . Icons by Font Awesome .