•  


?? Speed up `_get_child_links_recursive()` by 45% in `embedchain/loaders/docs_site_loader.py` by misrasaurabh1 · Pull Request #1266 · embedchain/embedchain · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

?? Speed up _get_child_links_recursive() by 45% in embedchain/loaders/docs_site_loader.py #1266

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

misrasaurabh1
Copy link
Contributor

Description

?? _get_child_links_recursive() in embedchain/loaders/docs_site_loader.py

?? Performance went up by 45% ( 0.45x faster )

?? Runtime went down from 1489.11μs to 1029.61μs

Explanation and details

(click to show)

To improve the performance of this program, you could apply a few changes:

  1. Add a check at the start of the _get_child_links_recursive() to return early if the url has already been visited. This avoids the overhead of starting an unnecessary HTTP request which is expensive in terms of time.

  2. Use soup.find_all("a", href=True) to only select anchors with href attribute, this will reduce computation time.

  3. Use a generator expression instead of list comprehension for child_links and absolute_paths to save memory by generating items one at a time, instead of creating the entire list in memory.

  4. Instead of looping through absolute_paths and adding links to visited_links inside function _get_child_links_recursive , we can use set operation to add all links at once which would be faster.

  5. At the end, instead of calling the recursive function one link at a time, we can use a list comprehension to start the recursion for all links.

Type of change

Please delete options that are not relevant.

  • Refactor (does not change functionality, e.g. code style improvements, linting)

How Has This Been Tested?

The new optimized code was tested for correctness. The results are listed below.

  • Test Script (please provide)

? 4 Passed ? ?? Generated Regression Tests

(click to show generated tests)
# imports

import
 pytest
  # used for our unit tests

from
 unittest
.
mock
 import
 patch
, 
MagicMock

from
 urllib
.
parse
 import
 urlparse
, 
urljoin

from
 bs4
 import
 BeautifulSoup

import
 requests

import
 logging


# Assuming BaseLoader is defined somewhere else, we import or define a mock here

class
 BaseLoader
:
    
pass


# function to test

# The function is part of the DocsSiteLoader class, so we don't need to redefine it here.


# unit tests

class
 TestDocsSiteLoader
:
    
@
patch
(
'requests.get'
)

    def
 test_valid_url_with_child_links
(
self
, 
mock_get
):
        
"""Test a URL that has valid child links."""

        # Setup mock response with an HTML page containing child links

        html_content
 =
 '<html><body><a href="/child1">Child 1</a><a href="/child2">Child 2</a></body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links contains the correct absolute URLs

        assert
 'http://example.com/child1'
 in
 loader
.
visited_links

        assert
 'http://example.com/child2'
 in
 loader
.
visited_links


    @
patch
(
'requests.get'
)

    def
 test_valid_url_no_child_links
(
self
, 
mock_get
):
        
"""Test a URL that has no child links."""

        # Setup mock response with an HTML page without child links

        html_content
 =
 '<html><body>No links here!</body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links only contains the original URL

        assert
 loader
.
visited_links
 ==
 set
()

    
@
patch
(
'requests.get'
)

    def
 test_non_200_status_code
(
self
, 
mock_get
):
        
"""Test a URL that returns a non-200 status code."""

        # Setup mock response with a 404 status code

        mock_response
 =
 MagicMock
(
status_code
=
404
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links is empty

        assert
 loader
.
visited_links
 ==
 set
()

    
@
patch
(
'requests.get'
)

    def
 test_recursive_behavior
(
self
, 
mock_get
):
        
"""Test recursive behavior with a page that links to itself."""

        # Setup mock response with an HTML page that links to itself

        html_content
 =
 '<html><body><a href="/">Home</a></body></html>'

        mock_response
 =
 MagicMock
(
status_code
=
200
, 
text
=
html_content
)
        
mock_get
.
return_value
 =
 mock_response


        loader
 =
 DocsSiteLoader
()
        
loader
.
_get_child_links_recursive
(
'http://example.com'
)

        
# Check that visited_links contains only the original URL

        assert
 'http://example.com/'
 in
 loader
.
visited_links

        assert
 len
(
loader
.
visited_links
) 
==
 1


# Additional tests would be needed to cover other scenarios, such as redirects, invalid HTML, network issues, etc.

# However, as per the requirement not to mock or stub dependencies, these tests would not be possible without an actual server setup.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 16, 2024
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base ( 2985b66 ) 56.60% compared to head ( 9ce0515 ) 56.76%.
Report is 22 commits behind head on main.

Files Patch % Lines
embedchain/loaders/docs_site_loader.py 85.71% 1 Missing ??
Additional details and impacted files
@@            Coverage Diff             @@

#
#             main    #1266      +/-   ##

==========================================
+
 Coverage   56.60%   56.76%   +0.15%     

==========================================
  Files         146      150       +4     
  Lines        5923     6067     +144     
==========================================
+
 Hits         3353     3444      +91     

-
 Misses       2570     2623      +53     

? View full report in Codecov by Sentry .
?? Have feedback on the report? Share it here .

Sign up for free to join this conversation on GitHub . Already have an account? Sign in to comment
Labels
size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본