When the requests crawler faces the website to be logged in, it has to analyze data packets and JS source code, construct complex requests, and often has to deal with anti- climbing methods such as verification codes, JS confusion, and signature parameters, which has a high threshold. If the data is generated by JS calculation, the calculation process must be reproduced. The experience is not good and the development efficiency is not high.
Using selenium, these pits can be bypassed to a large extent, but selenium is not efficient. Therefore, this library combines selenium and requests into one, switches the corresponding mode when different needs, and provides a user- friendly method to improve development and operation efficiency.
In addition to merging the two, the library also encapsulates common functions in web pages, simplifies selenium's operations and statements. When used for web page automation, it reduces the consideration of details, focuses on function implementation, and makes it more convenient to use.
Keep everything simple, try to provide simple and direct usage, and be more friendly to novices.
Only supports python3.6 and above, and the driver mode currently only supports chrome.
To use the driver mode, you must download chrome and **corresponding version** of chromedriver. [[chromedriver download]](https://chromedriver.chromium.org/downloads)
It has only been tested in the Windows environment.
If you choose the third method, please run these lines of code before using this library for the first time and record these two paths in the ini file.
- Different projects may require different versions of chrome and chromedriver. You can also save multiple ini files and use them as needed.
- It is recommended to use the green version of chrome, and manually set the path, to avoid browser upgrades causing mismatch with the chromedriver version.
- It is recommended to set the debugger_address when debugging the project and use the manually opened browser to debug, saving time and effort.
The creation step is not necessary. If you want to get started quickly, you can skip this section. The MixPage object will automatically create the object.
Drission objects are used to manage driver and session objects. When multiple pages work together, the Drission object is used to pass the driver, so that multiple page classes can control the same browser or Session object.
The configuration information of the ini file can be directly read and created, or the configuration information can be passed in during initialization.
The MixPage page object encapsulates common web page operations and realizes the switch between driver and session modes.
MixPage must receive a Drission object and use the driver or session in it. If it is not passed in, MixPage will create a Drission by itself (using the configuration of the default ini file).
Tips: When multiple page objects work together, remember to manually create a Drission object and pass it to the page object for use. Otherwise, the page objects will each create their own Drission objects, making the information unable to pass.
### Create Object
There are three ways to create objects: simple, passing in Drission objects, and passing in configuration. Can be selected according to actual needs.
```python
# Simple creation method, automatically create Drission objects with ini file default configuration
page = MixPage()
page = MixPage('s')
# Create by passing in the Drission object
page = MixPage(drission)
page = MixPage(drission, mode='s', timeout=5) # session mode, waiting time is 5 seconds (default 10 seconds)
# Create with incoming configuration information
page = MixPage(driver_options=DriverOption, session_options=SessionOption) # default d mode
```
### visit website
If there is an error in the connection, the program will automatically retry twice. The number of retries and the waiting interval can be specified.
```python
# Default mode
page.get(url)
page.post(url, data, **kwargs) # Only session mode has post method
# Specify the number of retries and interval
page.get(url, retry=5, interval=0.5)
```
### Switch mode
Switch between s and d modes, the cookies and the URL you are visiting will be automatically synchronized when switching.
page.post(url, data, retry, interval, **kwargs) # To access the webpage in post mode, you can specify the number of retries and the interval
# d mode unique:
page.wait_ele(loc_or_ele, mode, timeout) # Wait for the element to be deleted, displayed, and hidden from the dom
page.run_script(js, *args) # Run js statement
page.create_tab(url) # Create and locate a tab page, which is at the end
page.to_tab(num_or_handle) # Jump to tab page
page.close_current_tab() # Close the current tab page
page.close_other_tabs(num) # Close other tabs
page.to_iframe(iframe) # cut into iframe
page.screenshot(path) # Page screenshot
page.scrool_to_see(element) # Scroll until an element is visible
page.scroll_to(mode, pixel) # Scroll the page as indicated by the parameter, and the scroll direction is optional:'top','bottom','rightmost','leftmost','up','down','left', ' right'
page.refresh() # refresh the current page
page.back() # Browser back
page.et_window_size(x, y) # Set the browser window size, maximize by default
page.check_page() # Check whether the page meets expectations
page.chrome_downloading() # Get the list of files that chrome is downloading
page.process_alert(mode, text) # Process the prompt box
element.texts() # Returns the text of all direct child nodes in the element, including elements and text nodes, you can specify to return only text nodes
element.attrs # Return a dictionary of all attributes of the element
element.attr(attr) # Return the value of the specified attribute of the element
element.css_path # Return the absolute css path of the element
element.xpath # Return the absolute xpath path of the element
element.parent # return element parent element
element.next # Return the next sibling element of the element
element.prev # Return the previous sibling element of the element
element.parents(num) # Return the numth parent element
element.nexts(num, mode) # Return the following elements or nodes
element.prevs(num, mode) # Return the first few elements or nodes
element.ele(loc_or_str, timeout) # Return the first sub- element, attribute or node text of the current element that meets the conditions
element.eles(loc_or_str, timeout) # Return all eligible sub- elements, attributes or node texts of the current element
# Driver mode unique:
element.before # Get pseudo element before content
element.after # Get pseudo element after content
element.is_valid # Used to determine whether the element is still in dom
element.size # Get element size
element.location # Get element location
element.shadow_root # Get the ShadowRoot element under the element
element.get_style_property(style, pseudo_ele) # Get element style attribute value, can get pseudo element
element.is_selected() # Returns whether the element is selected
element.is_enabled() # Returns whether the element is available
element.is_displayed() # Returns whether the element is visible
```
## Element operation
Element operation is unique to d mode. Calling the following method will automatically switch to d mode.
```python
element.click(by_js) # Click the element, you can choose whether to click with js
element.input(value) # input text
element.run_script(js) # Run JavaScript script on the element
element.submit() # Submit
element.clear() # Clear the element
element.screenshot(path, filename) # Take a screenshot of the element
element.select(text) # Select the drop- down list based on the text
element.set_attr(attr, value) # Set element attribute value
element.drag(x, y, speed, shake) # Drag the relative distance of the element, you can set the speed and whether to shake randomly
element.drag_to(ele_or_loc, speed, shake) # Drag the element to another element or a certain coordinate, you can set the speed and whether to shake randomly
element.hover() # Hover the mouse over the element
```
## Docking with selenium code
The DrissionPage code can be seamlessly spliced with the selenium code, either directly using the selenium WebDriver object, or using its own WebDriver everywhere for the selenium code. Make the migration of existing projects very convenient.
### selenium to DrissionPage
```python
driver = webdriver.Chrome()
driver.get('https://www.baidu.com')
page = MixPage(Drission(driver)) # Pass the driver to Drission, create a MixPage object
print(page.title) # Print result: You will know by clicking on Baidu
```
### DrissionPage to selenium
```python
page = MixPage()
page.get('https://www.baidu.com')
driver = page.driver # Get the WebDriver object from the MixPage object
print(driver.title) # Print results: You will know by clicking on Baidu
```
## download file
Selenium lacks effective management of browser download files, and it is difficult to detect download status, rename, and fail management.
Using requests to download files can better achieve the above functions, but the code is more cumbersome.
Therefore, DrissionPage encapsulates the download method and integrates the advantages of the two. You can obtain login information from selenium and download it with requests.
To make up for the shortcomings of selenium, make the download simple and efficient.
### Features
- Specify download path
- Rename the file without filling in the extension, the program will automatically add
- When there is a file with the same name, you can choose to rename, overwrite, skip, etc.
Because there are many configurations of chrome and headers, an ini file is set up specifically to save common configurations. You can use the OptionsManager object to get and save the configuration, and use the DriverOptions object to modify the chrome configuration. You can also save multiple ini files and call them according to different projects.
Tips: It is recommended to save the commonly used configuration files to another path to prevent the configuration from being reset when the library is upgraded.
**Note**: If you do not pass in the path when saving, it will be saved to the ini file in the module directory, even if the read is not the default ini file.
set_user_agent('Mozilla/5.0 (Macintosh; Int......') # set user agent
set_proxy('127.0.0.1:8888') # set proxy
set_paths(paths) # See [Initialization] section
set_argument(arg, value) # Set the attribute. If the attribute has no value (such as'zh_CN.UTF- 8'), the value is bool, which means switch; otherwise, the value is str. When the value is'' or False, delete the attribute item
- ini_path: str - ini file path, the default is the ini file under the DrissionPage folder
- proxy: dict - proxy settings
### session
Return the Session object, which is automatically initialized according to the configuration information.
Returns: Session- the managed Session object
### driver
Return the WebDriver object, which is automatically initialized according to the configuration information.
Returns: WebDriver- Managed WebDriver object
### driver_options
Return or set the driver configuration.
Returns: dict
### session_options
Return to session configuration.
Returns: dict
### session_options()
Set the session configuration.
Returns: None
### proxy
Return to proxy configuration.
Returns: dict
### cookies_to_session()
Copy the cookies of the driver object to the session object.
Parameter Description:
- copy_user_agent: bool - whether to copy user_agent to session
- driver: WebDriver- Copy the WebDriver object of cookies
- session: Session- Session object that receives cookies
Returns: None
### cookies_to_driver()
Copy cookies from session to driver.
Parameter Description:
- url: str - the domain of cookies
- driver: WebDriver- WebDriver object that receives cookies
- session: Session- Copy the Session object of cookies
Returns: None
### user_agent_to_session()
Copy the user agent from the driver to the session.
Parameter Description:
- driver: WebDriver- WebDriver object, copy user agent
- session: Session- Session object, receiving user agent
Returns: None
### close_driver()
Close the browser and set the driver to None.
Returns: None
### close_session()
Close the session and set it to None.
Returns: None
### close()
Close the driver and session.
Returns: None
## MixPage Class
### class MixPage()
MixPage encapsulates the common functions of page operation and can seamlessly switch between driver and session modes. Cookies are automatically synchronized when switching.
The function of obtaining information is shared by the two modes, and the function of operating page elements is only available in mode d. Calling a function unique to a certain mode will automatically switch to that mode.
It inherits from DriverPage and SessionPage classes, these functions are implemented by these two classes, and MixPage exists as a scheduling role.
Parameter Description:
- drission: Drission - Drission object, if not passed in, create one. Quickly configure the corresponding mode when's' or'd' is passed in
- timeout: float - timeout, driver mode is the time to find elements, session mode is the connection waiting time
### url
Returns the URL currently visited by the MixPage object.
Returns: str
### mode
Returns the current mode ('s' or'd').
Returns: str
### drission
Returns the Dirssion object currently in use.
Returns: Drission
### driver
Return the driver object, if not, create it, and switch to driver mode when calling.
Returns: WebDriver
### session
Return the session object, if not, create it.
Returns: Session
### response
Return the Response object obtained in s mode, and switch to s mode when called.
Returns: Response
### cookies
Return cookies, obtained from the current mode.
Returns: [dict, list]
### html
Return the html text of the page.
Returns: str
### title
Return to the page title.
Returns: str
### url_available
Returns the validity of the current url.
Returns: bool
### change_mode()
Switch mode,'d' or's'. When switching, the cookies of the current mode will be copied to the target mode.
Parameter Description:
- mode: str - Specify the target mode,'d' or's'.
- go: bool - whether to jump to the current url after switching mode
Returns: None
### ele()
Return the eligible elements on the page, the first one is returned by default.
If the query parameter is a string, the options of'@attribute name:','tag:','text:','css:', and'xpath:' are available. When there is no control mode, the text mode is used to search by default.
If it is loc, query directly according to the content.
Parameter Description:
- loc_or_str: [Tuple[str, str], str, DriverElement, SessionElement, WebElement] - The positioning information of the element, which can be an element object, a loc tuple, or a query string
- mode: str - 'single' or'all', corresponding to find one or all
- timeout: float - Find the timeout of the element, valid in driver mode
Example:
- When the element object is received: return the element object object
- Find with loc tuple:
- ele.ele((By.CLASS_NAME,'ele_class')) - returns the first child element whose class is ele_class
- Find with query string:
Attributes, tag name and attributes, text, xpath, css selector.
Among them, @ means attribute, = means exact match,: means fuzzy match, the string is searched by default when there is no control string.
- page.ele('@class:ele_class') - returns the element with ele_class in the first class
- page.ele('@name=ele_name') - returns the first element whose name is equal to ele_name
- page.ele('@placeholder') - returns the first element with placeholder attribute
- page.ele('tag:p') - return the first p element
- page.ele('tag:div@class:ele_class') - returns the first class div element with ele_class
- page.ele('tag:div@class=ele_class') - returns the first div element whose class is equal to ele_class
- page.ele('tag:div@text():some_text') - returns the first div element whose text contains some_text
- page.ele('tag:div@text()=some_text') - returns the first div element whose text is equal to some_text
- page.ele('text:some_text') - returns the first element whose text contains some_text
- page.ele('some_text') - returns the first text element containing some_text (equivalent to the previous line)
- page.ele('text=some_text') - returns the first element whose text is equal to some_text
- page.ele('xpath://div[@class="ele_class"]') - return the first element that matches xpath
- page.ele('css:div.ele_class') - returns the first element that matches the css selector
Returns: [DriverElement, SessionElement, str] - element object or attribute, text node text
### eles()
Get the list of elements that meet the conditions according to the query parameters. The query parameter usage method is the same as the ele method.
- timeout: float - Find the timeout of the element, valid in driver mode
Returns: [List[DriverElement or str], List[SessionElement or str]] - a list of element objects or attributes and text node text
### cookies_to_session()
Copy cookies from the WebDriver object to the Session object.
Parameter Description:
- copy_user_agent: bool - whether to copy user agent at the same time
Returns: None
### cookies_to_driver()
Copy cookies from the Session object to the WebDriver object.
Parameter Description:
- url: str - the domain or url of cookies
Returns: None
### get()
To jump to a url, synchronize cookies before the jump, and return whether the target url is available after the jump.
Parameter Description:
- url: str - target url
- go_anyway: bool - Whether to force a jump. If the target url is the same as the current url, it will not redirect by default.
- show_errmsg: bool - whether to display and throw an exception
- retry: int - the number of retries when a connection error occurs
- interval: float - Retry interval (seconds)
- **kwargs - connection parameters for requests
Returns: [bool, None] - whether the url is available
### post()
Jump in post mode, automatically switch to session mode when calling.
Parameter Description:
- url: str - target url
- data: dict - submitted data
- go_anyway: bool - Whether to force a jump. If the target url is the same as the current url, it will not redirect by default.
- show_errmsg: bool - whether to display and throw an exception
- retry: int - the number of retries when a connection error occurs
- interval: float - Retry interval (seconds)
- **kwargs - connection parameters for requests
Returns: [bool, None] - whether the url is available
### download()
Download a file, return whether it is successful and the download information string. This method will automatically avoid the same name with the existing file in the target path.
Parameter Description:
- file_url: str - file url
- goal_path: str - storage path, the default is the temporary folder specified in the ini file
- rename: str - rename the file without changing the extension
- file_exists: str - If there is a file with the same name, you can choose'rename','overwrite','skip' to process
- post_data: dict - data submitted in post mode
- show_msg: bool - whether to show download information
- show_errmsg: bool - whether to display and throw an exception
- **kwargs - connection parameters for requests
Returns: Tuple[bool, str] - a tuple of whether the download was successful (bool) and status information (the information is the file path when successful)
The following methods and properties only take effect in driver mode, and will automatically switch to driver mode when called
***
### tabs_count
Returns the number of tab pages.
Returns: int
### tab_handles
Returns the handle list of all tabs.
Returns: list
### current_tab_num
Returns the serial number of the current tab page.
Returns: int
### current_tab_handle
Returns the handle of the current tab page.
Returns: str
### wait_ele()
Wait for the element to be deleted, displayed, and hidden from the dom.
Parameter Description:
- loc_or_ele: [str, tuple, DriverElement, WebElement] - Element search method, same as ele()
In d mode, check whether the web page meets expectations. The response status is checked by default, and can be overloaded to achieve targeted checks.
Parameter Description:
- by_requests: bool - Force the use of built- in response for checking
Return: [bool, None] - bool is available, None is unknown
### run_script()
Execute JavaScript code.
Parameter Description:
- script: str - JavaScript code text
- *args - incoming parameters
Returns: Any
### create_tab()
Create and locate a tab page, which is at the end.
Parameter Description:
- url: str - the URL to jump to the new tab page
Returns: None
### close_current_tab()
Close the current tab.
Returns: None
### close_other_tabs()
Close tab pages other than the incoming tab page, and keep the current page by default.
Parameter Description:
- num_or_handle:[int, str] - The serial number or handle of the tab to keep, the first serial number is 0, and the last is - 1
Returns: None
### to_tab()
Jump to the tab page.
Parameter Description:
- num_or_handle:[int, str] - tab page serial number or handle string, the first serial number is 0, the last is - 1
Returns: None
### to_iframe()
Jump to iframe, jump to the highest level by default, compatible with selenium native parameters.
Parameter Description:
- loc_or_ele:[int, str, tuple, WebElement, DriverElement] - Find the condition of iframe element, can receive iframe serial number (starting at 0), id or name, query string, loc parameter, WebElement object, DriverElement object, and pass in ' main' jump to the highest level, and pass in'parent' to jump to the upper level
Example:
- to_iframe('tag:iframe')- locate by the query string passed in iframe
- to_iframe('iframe_id')- Positioning by the id attribute of the iframe
- to_iframe('iframe_name')- locate by the name attribute of iframe
- to_iframe(iframe_element)- locate by passing in the element object
- to_iframe(0)- locate by the serial number of the iframe
- to_iframe('main')- jump to the top level
- to_iframe('parent')- jump to the previous level
Returns: None
### scroll_to_see()
Scroll until the element is visible.
Parameter Description:
- loc_or_ele:[str, tuple, WebElement, DriverElement] - The conditions for finding elements are the same as those of the ele() method.
Returns: None
### scroll_to()
Scroll the page and decide how to scroll according to the parameters.
Returns the sub- elements, attributes or node texts of the current element that meet the conditions.
If the query parameter is a string, the options of'@attribute name:','tag:','text:','css:', and'xpath:' are available. When there is no control mode, the text mode is used to search by default.
If it is loc, query directly according to the content.
- ele_or_loc[tuple, WebElement, DrissionElement] - Another element or relative current position, the coordinates are the coordinates of the element's midpoint.
If the query parameter is a string, you can choose the methods of'@attribute name:','tag:','text:','css:', and'xpath:'. When there is no control mode, the text mode is used to search by default.
If it is loc, query directly according to the content.
The Chrome browser configuration class, inherited from the Options class of selenium.webdriver.chrome.options, adds the methods of deleting configuration and saving to file.
Remove all plug- ins, because plug- ins are stored in the entire file, it is difficult to remove one of them, so if you need to set, remove all and reset.
Chrome's configuration is too difficult to remember, so the commonly used configuration is written as a simple method, and the call will modify the relevant content of the ini file.
Set the properties. If the attribute has no value (such as'zh_CN.UTF- 8'), value is passed in bool to indicate switch; otherwise, value is passed in str, and when value is'' or False, delete the attribute item.