?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` #1266

misrasaurabh1 · 2024-02-16T11:50:24Z

Description

?? `_get_child_links_recursive()` in `embedchain/loaders/docs_site_loader.py`

?? Performance went up by 45% ( 0.45x faster )

?? Runtime went down from 1489.11μs to 1029.61μs

Explanation and details

(click to show)

To improve the performance of this program, you could apply a few changes:

Add a check at the start of the _get_child_links_recursive() to return early if the url has already been visited. This avoids the overhead of starting an unnecessary HTTP request which is expensive in terms of time.
Use soup.find_all("a", href=True) to only select anchors with href attribute, this will reduce computation time.
Use a generator expression instead of list comprehension for child_links and absolute_paths to save memory by generating items one at a time, instead of creating the entire list in memory.
Instead of looping through absolute_paths and adding links to visited_links inside function _get_child_links_recursive, we can use set operation to add all links at once which would be faster.
At the end, instead of calling the recursive function one link at a time, we can use a list comprehension to start the recursion for all links.

Type of change

Please delete options that are not relevant.

Refactor (does not change functionality, e.g. code style improvements, linting)

How Has This Been Tested?

The new optimized code was tested for correctness. The results are listed below.

Test Script (please provide)

? 4 Passed ? ?? Generated Regression Tests

(click to show generated tests)

# imports

import
 pytest
  # used for our unit tests

from
 unittest
.
mock
 import
 patch
, 
MagicMock

from
 urllib
.
parse
 import
 urlparse
, 
urljoin

from
 bs4
 import
 BeautifulSoup

import
 requests

import
 logging


# Assuming BaseLoader is defined somewhere else, we import or define a mock here

class
 BaseLoader
:
    
pass


# function to test

# The function is part of the DocsSiteLoader class, so we don't need to redefine it here.


# unit tests

class
 TestDocsSiteLoader
:
    
@
patch
(
'requests.get'
)

    def
 test_valid_url_with_child_links
(
self
, 
mock_get
):
        
"""Test a URL that has valid child links."""

        # Setup mock response with an HTML page containing child links

        html_content
 =
 '<html><body><a href="/child1">Child 1</a><a href="/child2">Child 2</a></body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links contains the correct absolute URLs

        assert
 'http://example.com/child1'
 in
 loader
.
visited_links

        assert
 'http://example.com/child2'
 in
 loader
.
visited_links


    @
patch
(
'requests.get'
)

    def
 test_valid_url_no_child_links
(
self
, 
mock_get
):
        
"""Test a URL that has no child links."""

        # Setup mock response with an HTML page without child links

        html_content
 =
 '<html><body>No links here!</body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links only contains the original URL

        assert
 loader
.
visited_links
 ==
 set
()

    
@
patch
(
'requests.get'
)

    def
 test_non_200_status_code
(
self
, 
mock_get
):
        
"""Test a URL that returns a non-200 status code."""

        # Setup mock response with a 404 status code

        mock_response
 =
 MagicMock
(
status_code
=
404
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links is empty

        assert
 loader
.
visited_links
 ==
 set
()

    
@
patch
(
'requests.get'
)

    def
 test_recursive_behavior
(
self
, 
mock_get
):
        
"""Test recursive behavior with a page that links to itself."""

        # Setup mock response with an HTML page that links to itself

        html_content
 =
 '<html><body><a href="/">Home</a></body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links contains only the original URL

        assert
 'http://example.com/'
 in
 loader
.
visited_links

        assert
 len
(
loader
.
visited_links
) 
==
 1


# Additional tests would be needed to cover other scenarios, such as redirects, invalid HTML, network issues, etc.

# However, as per the requirement not to mock or stub dependencies, these tests would not be possible without an actual server setup.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Maintainer Checklist

closes #xxxx (Replace xxxx with the GitHub issue number)
Made sure Checks passed

codecov · 2024-02-20T19:15:04Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base ( 2985b66) 56.60% compared to head ( 9ce0515) 56.76%.
Report is 22 commits behind head on main.

Files	Patch %	Lines
embedchain/loaders/docs_site_loader.py	85.71%	1 Missing ??

Additional details and impacted files

@@            Coverage Diff             @@

#
#             main    #1266      +/-   ##

==========================================
+
 Coverage   56.60%   56.76%   +0.15%     

==========================================
  Files         146      150       +4     
  Lines        5923     6067     +144     
==========================================
+
 Hits         3353     3444      +91     

-
 Misses       2570     2623      +53

? View full report in Codecov by Sentry .
?? Have feedback on the report? Share it here .

?? Speed up _get_child_links_recursive by 45%

345d34d

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 16, 2024

fix bug

9ce0515

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` #1266

?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` #1266

misrasaurabh1 commented Feb 16, 2024

codecov bot commented Feb 20, 2024

?? Speed up _get_child_links_recursive() by 45% in embedchain/loaders/docs_site_loader.py #1266

Are you sure you want to change the base?

?? Speed up _get_child_links_recursive() by 45% in embedchain/loaders/docs_site_loader.py #1266

Conversation

misrasaurabh1 commented Feb 16, 2024

Description

?? _get_child_links_recursive() in embedchain/loaders/docs_site_loader.py

Explanation and details

Type of change

How Has This Been Tested?

? 4 Passed ? ?? Generated Regression Tests

Checklist:

Maintainer Checklist

codecov bot commented Feb 20, 2024

Codecov Report

?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` #1266

?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` #1266

?? `_get_child_links_recursive()` in `embedchain/loaders/docs_site_loader.py`