The cleaning stage described in this study not only refers to identifying and removing duplicate records but also involves identifying and removing several records that may not contribute significantly to the network. The ESRI Shapefiles used in our study contain polylines representing road segments. A polyline in our ESRI Shapefile is considered inadequate when,
Using this strategy, we can remove a considerable number of road segments from our graph without distorting the actual structure of the graph. For instance, the largest network we used in our study had 15,257 road segments. After removing the polylines complying with the aforementioned conditions, we had 7997 (only 52.4 % of the original) road segments without any considerable difference between the networks, as shown in
.
Geographic feature extraction
A graph is defined as a collection of vertices (nodes) and edges that connects a pair of vertices. Mathematically, a graph is defined as a pair of sets,
G
= (V, E), where V is a set of vertices and E is a set of Edges. Both NetworkX
[4]
and OSMnx
[5]
require the data in the same format to create a graphical representation of a network. In contrast to OSMnx
[5]
NetworkX
[4]
requires the user to manually extract the set of vertices and edges from the graph. While OSMnx
[5]
provides a convenient feature for directly extracting the required set of vertices and edges from the graph, this can lead to potential data quality issues. For our study, we extracted our data separately using python package Shapely and commonly used string manipulation techniques. Steps required to extract the unique set of vertices and edges are as followed:
-
1.
Extract the coordinates of the first node of each polyline (Using Shapely
[9]
).
-
2.
Extract the coordinates of the last node of each polyline (Using Shapely
[9]
).
-
3.
Using the
intersection
property provided by python package Shapely
[9]
, compute the polylines that share an intersecting geometry with other polylines and create a set for edges. As stated in the data cleaning strategy, two geometries sharing multiple intersections are considered invalid.
-
4.
Extract the coordinates of all intersection nodes. It must be noted that a single polyline can share geometries with multiple unique polylines at the same or different intersection point.
-
5.
Calculate the weight (distance) for each edge by first splitting the original polyline into two parts at the intersection node, and then calculate the length of both parts using the
split
and
length
property respectively, provided by Shapely
[9]
.
-
6.
At this point we now have two sets:
-
a.
Set of nodes containing first, last, and intersection nodes.
-
b.
Set of edges containing weights between two nodes.
-
7.
Remove any duplicates from both sets.
(a) and (b) display the Shapefile Representation of Texas after cleaning and the graphical representation of Texas using NetworkX, respectively. Data cleaning steps described in Section 1.2.3 reduce the network size by 48 % but still the graphical representation obtained consists of 10,858 unique nodes and 24,407 edges.
illustrates the potential and reproducibility of our strategy through graphical representations of three unique networks of varying sizes.
(a) shows an ESRI Shapefile depiction of Louisiana, which includes 9151 roadways. We were able to minimize the network size to 4616 roads, or approximately 50 % of the original size, by using the data cleaning procedure outlined in Section 1.2.3. Oklahoma's network, represented in
(c), was the smallest of the networks investigated in our study, with 5838 roadways. Following the data cleaning, only 2743 roads (about 48 %) were deemed appropriate for investigation. The graphical representation, shown in
(d), included 3344 distinct nodes and 6889 edges. Furthermore,
(e) and 4(f) show the Arkansas network's ESRI Shapefile and graphical representation, respectively. The data cleaning approach mentioned in Section 1.2.3 reduced the network size from 5932 roads to 3012 roads (about 51 %), as in the prior situations. The finished graphical network had 4368 distinct nodes and 8966 edges.
Conclusion
In this study, we proposed a method to create complex real-world networks using ESRI Shapefiles. The proposed method successfully resolves the issue of transforming ESRI Shapefile data into the necessary format for well-known graph analysis libraries like OSMnx and NetworkX. The proposed method enables researchers to quickly create graphical networks for their research projects, allowing them to test their theories on actual networks. The method's simplified data format avoids the need for time-consuming data preprocessing, allowing researchers to focus more on the analysis itself. The data cleaning strategy described in this study not only reduces resource consumption but also enhances the user's understanding of graph networks. As proof of its efficacy, the proposed method was successfully used to convert the ESRI Shapefile data from Texas, Louisiana, Arkansas, and Oklahoma into the graphical representations of the networks. As a result, this study significantly advances the field by empowering other researchers to evaluate the efficacy of graph algorithms.
Credit author statement
Harish:
Conceptualization, formal analysis, investigation, data curation, software, methodolog, visualization, writing – original draft, writing – review and editing.
Peter Mooney:
Supervision, writing – original draft, writing – review & editing.
Edgar Galván:
Supervision, writing – original draft, writing – review & editing. All authors have agreed to the publication of this manuscript.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Department of Computer Science, Maynooth University, Ireland. The authors wish to acknowledge the DJEI/DES/SFI/HEA Irish Centre for High-End Computing (ICHEC) for the provision of computational facilities and support.
References
1.
Dijkstra E.W. A note on two problems in connexion with graphs.
Numer. Math.
1959;
1
:269–271.
[
Google Scholar
]
2.
Bellman R. On a routing problem.
Q. Appl. Math.
1958;
16
(1):87–90.
[
Google Scholar
]
3.
Tarjan R. Depth-first search and linear graph algorithms.
SIAM J. Comput.
1972;
1
(2):146–160.
[
Google Scholar
]
4.
Hagberg A., Swart P., S Chult D. Los Alamos National Lab.(LANL); Los Alamos, NM (United States): 2008.
Exploring Network structure, dynamics, and Function Using NetworkX
(No. LA-UR-08-05495; LA-UR-08-5495)
[
Google Scholar
]
5.
Boeing G. OSMnx: new methods for acquiring, constructing, analyzing, and visualizing complex street networks.
Comput. Environ. Urban Syst.
2017;
65
:126–139.
[
Google Scholar
]
6.
Hunter J.D. Matplotlib: a 2D graphics environment.
Comput. Sci. Eng.
2007;
9
(03):90–95.
[
Google Scholar
]
7.
McKinney, W., & others. (2020). pandas: powerful data analysis toolkit (Version 1.3.0) [Software]. Zenodo. Retrieved from
https://pandas.pydata.org/
.
8.
Jordahl K., Van den Bossche J., Fleischmann M., Wasserman J., McBride J., Gerard J., Tratner J., Perry M., Garcia Badaracco A., Farmer C., Hjelle G.A. geopandas/geopandas: v0. 8.1.
Zenodo.
2020
[
Google Scholar
]
9.
Westra E. Packt Publishing Ltd; 2015. Python Geospatial Analysis Essentials.
[
Google Scholar
]