When a novice learns a web crawler, in the face of a website that needs to log in, it is necessary to analyze data packets, JS source code, construct complex requests, and often have to deal with verification codes, JS confusion, signature parameters and other measures, which is difficult to learn. When acquiring data, some data is generated by JavaScript calculation. If you only get the source data, you must also reproduce the calculation process. The experience is not good and the development efficiency is not high.
Using selenium can avoid these problems to a great extent, but selenium is not efficient. Therefore, what this library has to do is to combine selenium and requests into one, and provide a humanized use method to improve development and operation efficiency.
In addition to merging the two, this library also encapsulates commonly used functions in units of web pages, which simplifies selenium operations and statements. When used in web page automation operations, it reduces the consideration of details, focuses on function implementation, and is more convenient to use.
The design concept of this library is to keep everything simple, try to provide a simple and direct method of use, and is more friendly to novices.
Only python3.6 and above are supported. Driver mode currently only supports chrome.
To use the driver mode, you must download chrome and ** corresponding version ** of chromedriver. [chromedriver download](https://chromedriver.chromium.org/downloads)
Currently only tested in the Windows environment.
# Instructions
***
## import
```python
from DrissionPage import *
```
## Initialization
Before using selenium, you must configure the path of chrome.exe and chromedriver.exe and ensure that their versions match.
If you only use session mode, you can skip this section.
There are three ways to configure the path:
-Write two paths to system variables.
-Pass in the path manually when using it.
-Write the path to the ini file of this library (recommended).
If you choose the third method, please run these lines of code before using the library for the first time, and record these two paths in the ini file.
Drission objects are used to manage driver and session objects.Drission objects are used to transmit drives when multiple pages work together, enabling multiple page classes to control the same browser or Session object.
It can be created by directly reading the configuration information of the ini file, or it can be passed in during initialization.
The MixPage page object encapsulates commonly used web page operations and implements the switch between driver and session mode.
MixPage must receive a Drission object and use its driver or session. If no one is sent, MixPage will create a Drission itself (Use configurations from the default INI file).
Tips: When multi-page objects work together, remember to manually create Drission objects and transfer them to page objects for use. Otherwise, page objects can create their own Drission objects, rendering the information impossible to transmit.
element.text # Returns the text value after removing the html tag in the element
element.tag # Return element tag name
element.attrs # Returns a dictionary of all attributes of the element
element.attr('class') # Returns the element's class attribute
element.is_valid # Driver mode only, used to determine whether the element is still available
# Operating element
element.click() # Click element
element.input(text) # Enter text
element.run_script(js) # Run js
element.submit() # submit Form
element.clear() # Clear element
element.is_selected() # Is selected
element.is_enabled() # it's usable or not
element.is_displayed() # Is it visible
element.is_valid() # Whether it is valid, used to judge the situation where the page jump causes the element to fail
element.select(text) # Select the drop-down list option
element.set_attr(attr,value) # Set element attributes
element.size # Returns the element size
element.location # Returns the element position
```
## Save configuration
Because chrome and headers have many configurations, an ini file is set up to save commonly used configurations. You can use the OptionsManager object to get and save the configuration, and use the DriverOptions object to modify the chrome configuration. You can also save multiple ini files and call them according to different projects.
Tips:It is recommended to save common configuration files to another path to prevent the configuration from being reset when the library is upgraded.
### ini file
The ini file has three parts by default: paths, chrome_options, and session_options. The initial contents are as follows.
```ini
[paths]
; chromedriver.exe path
chromedriver_path =
; Temporary folder path, used to save screenshots, download files, etc.
global_tmp_path =
[chrome_options]
; The opened browser address and port, such as 127.0.0.1:9222
debugger_address =
; chrome.exe path
binary_location =
; Configuration information
arguments = [
; Hide browser window
'--headless',
; Mute
'--mute-audio',
; No sandbox
'--no-sandbox',
; Google documentation mentions the need to add this attribute to avoid bugs
The OptionsManager object is used to read, set, and save configurations.
```python
get_value(section, item) -> str # Get the value of a configuration
get_option(section) -> dict # Return all configuration properties in dictionary format
set_item(section, item, value) # Set configuration properties
save() # Save configuration to default ini file
save('D:\\settings.ini') # Save to other path
```
** Note **: If you do not pass in the path when saving, it will be saved to the ini file in the module directory, even if you are not reading the default ini file.
### DriverOptions object
The DriverOptions object inherits from the Options object of selenium.webdriver.chrome.options, and adds the following methods to it:
```python
remove_argument(value) # Delete an argument value
remove_experimental_option(key) # Delete a experimental_option setting
set_argument(arg, on_off) # Set the property. If the property has no value (e.g. 'zh_CN.utf-8'), the value is bool representing the switch. If value is "" or False, delete the attribute entry
Returns the HTMLSession object, which is created automatically when called.
### driver
Obtain the WebDriver object, which is automatically created when it is called and initialized according to the incoming configuration or ini file configuration.
Copy cookies from session to driver. By default, self.session is copied to self.driver, and driver and session can also be received for operation. Need to specify url or domain name.
Copy the user agent from the driver to the session. By default, self.driver is copied to self.session, and driver and session can also be received for operation.
MixPage encapsulates common functions for page operations and can seamlessly switch between driver and session modes. Cookies are automatically synchronized when switching.
The function of obtaining information is common to the two modes, and the function of operating page elements is only available in the d mode. Calling a function unique to a certain mode will automatically switch to that mode.
It inherits from DriverPage and SessionPage classes. These functions are implemented by these two classes. MixPage exists as a scheduling role.
Get elements according to query parameters and return elements or element lists.
If the query parameter is a string, you can select the '@property name:', 'tag:', 'text:', 'css:', 'xpath:' method. When there is no control mode, it is searched by text mode by default.
If it is loc, query directly according to the content.
Parameter Description:
- loc_or_str - Query condition parameters, if an element object is passed in, return directly
- mode - Find one or more, pass in 'single' or 'all'
- timeout - Search element timeout time, valid in driver mode
- show_errmsg - Whether to throw and display when an exception occurs
Examples:
- page.ele('@id:ele_id') - Find elements by attributes
- page.ele('tag:div') - Find elements by tag name
- page.ele('text:some text') - Find elements by text
- page.ele('some text') - Find elements by text
- page.ele('css:>div') - Find elements by css selector
- page.ele('xpath://div') - Find elements by xpath
- page.ele((By.ID, 'ele_id')) - Find elements by loc
Download a file, return success and download information string. Changing the method will automatically avoid renaming the existing file in the target path.
Parameter Description:
- file_url - File URL
- goal_path - Storage path, the default is the temporary folder specified in the ini file
- rename - Rename the file name, not renamed by default
- loc_or_ele - To search for iframe element conditions, you can receive iframe serial number (starting at 0), id or name, control string, loc parameter, WebElement object, DriverElement object, pass 'main' to jump to the top level, pass 'parent' to jump to parent level.
If the query parameter is a string, you can select the '@property name:', 'tag:', 'text:', 'css:', and 'xpath:' methods. When there is no control mode, it is searched by text mode by default.
If it is loc, query directly according to the content.
Parameter Description:
- loc_or_str - Query condition parameters
- mode - Find one or more, pass in 'single' or 'all'
- show_errmsg - Whether to throw and display when an exception occurs
- timeout - Find Element Timeout
Examples::
- element.ele('@id:ele_id') - Find elements by attributes
- element.ele('tag:div') - Find elements by tag name
- element.ele('text:some text') - Find elements by text
- element.ele('some text') - Find elements by text
- element.ele('css:>div') - Find elements by css selector
- element.ele('xpath://div') - Find elements by xpath
- element.ele((By.ID, 'ele_id')) - Find elements by loc
If the query parameter is a string, you can select the '@property name:', 'tag:', 'text:', 'css:', and 'xpath:' methods. When there is no control mode, it is searched by text mode by default.
If it is loc, query directly according to the content.
Parameter Description:
- loc_or_str - Query condition parameters
- mode - Find one or more, pass in 'single' or 'all'
- show_errmsg - Whether to throw and display when an exception occurs
Examples:
- element.ele('@id:ele_id') - Find elements by attributes
- element.ele('tag:div') - Find elements by tag name
- element.ele('text:some text') - Find elements by text
- element.ele('some text') - Find elements by text
- element.ele('css:>div') - Find elements by css selector
- element.ele('xpath://div') - Find elements according to xpath
- element.ele((By.ID, 'ele_id')) - Find elements according to loc
- path - The path of the ini file, which is saved to the module folder by default
## DriverOptions class
class DriverOptions(read_file=True)
The chrome browser configuration class, inherited from the Options class of selenium.webdriver.chrome.options, adds methods to delete configuration and save to file.
Parameter Description:
- read_file - Boolean, specifies whether to read configuration information from the ini file when creating
### remove_argument
remove_argument(value: str) -> None
Remove a setting.
Parameter Description:
- value - The attribute value to be removed
### remove_experimental_option
remove_experimental_option(key: str) -> None
Remove an experiment setting and delete the incoming key value.
Parameter Description:
- key - The key value of the experiment to be removed
### remove_argument
remove_argument() -> None
Remove all plug-ins, because the plug-in is stored in the entire file, it is difficult to remove one of them, so if you need to set, remove all and reset.
### save()
save(path: str = None) -> None
Save the settings to a file.
Parameter Description:
- path - The path of the ini file, which is saved to the module folder by default
The configuration of chrome is too difficult to remember, so the commonly used configuration is written as a simple method, and the call will modify the relevant content of the ini file.
Set the properties. If the attribute has no value (such as' zh_CN.utf-8 '), the value is passed into the bool to indicate the switch; Otherwise, value passes in STR, and when value is "" or False, the attribute entry is deleted.