As Python developers, we need to learn lots of 3pp modules. Reading documents is required to get familiar with those modules’ functions. How to quickly know if the given module has any functions related to what we are working on? This article is to tell a trick method to do it.
First of all, the built-in function dir() could be used to list all attribute names of the given objects which could be anything in Python.
For example, to know what attributes (mainly interested in functions) are supported by str:
In [25]: list(filter(lambda i: not i.startswith('_'), dir(str)))
Out[25]:
['capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isascii',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']
Secondly, all Python object has a special function __doc__ to show the docstring of the object.
For example, to show the docstring of str.split function:
In [29]: str.split.__doc__
Out[29]: 'Return a list of the words in the string, using sep as the delimiter string.\n\n sep\n The delimiter according which to split the string.\n None (the default value) means split according to any whitespace,\n and discard empty strings from the result.\n maxsplit\n Maximum number of splits to do.\n -1 (the default value) means no limit.'
As you see, the __doc__ output is in one line, and \n is not escaped at all.
To solve it, we need to use the built-in function print with parameter end=””:
In [30]: print(str.split.__doc__, end="")
Return a list of the words in the string, using sep as the delimiter string.
sep
The delimiter according which to split the string.
None (the default value) means split according to any whitespace,
and discard empty strings from the result.
maxsplit
Maximum number of splits to do.
-1 (the default value) means no limit.
Thirdly, the built-in function getattr can be used to get a named attribute of an object.
In [33]: print(getattr(str, 'split').__doc__, end="")
Return a list of the words in the string, using sep as the delimiter string.
sep
The delimiter according which to split the string.
None (the default value) means split according to any whitespace,
and discard empty strings from the result.
maxsplit
Maximum number of splits to do.
-1 (the default value) means no limit.
Why do we need getattr, we want to go through all functions of the given module and apply filter regex to find out those functions related to the interesting functionalities. And the output of dir is a list of str which is the attribute name. We cannot get __doc__ directly from the attribute name, instead, we need the attribute itself.
For example, we want to know which functions of str are related to substr:
In [39]: for a in list(filter(lambda i: not i.startswith('_'), dir(str))):
: if re.match('substr', a) or re.match('.*substr.*', getattr(str, a).__doc__, re.DOTALL):
: print(f'*** {a} ***')
: print(getattr(str, a).__doc__, end="")
: print('\n')
*** count ***
S.count(sub[, start[, end]]) -> int
Return the number of non-overlapping occurrences of substring sub in
string S[start:end]. Optional arguments start and end are
interpreted as in slice notation.
*** find ***
S.find(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
*** index ***
S.index(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
*** replace ***
Return a copy with all occurrences of substring old replaced by new.
count
Maximum number of occurrences to replace.
-1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are
replaced.
*** rfind ***
S.rfind(sub[, start[, end]]) -> int
Return the highest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
*** rindex ***
S.rindex(sub[, start[, end]]) -> int
Return the highest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
Finally, to make the code more generic, here is the final version of the code to filter functions’ names and docstring of any object:
Example to show functions in the module pyspark.sql.functions about map size:
show_functions_docs(filter_regex = 'map.*size')
///////////////////////////////////////////////////////////
size
///////////////////////////////////////////////////////////
Collection function: returns the length of the array or map stored in the column.
.. versionadded:: 1.5.0
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
name of column or expression
Examples
--------
>>> df = spark.createDataFrame([([1, 2, 3],),([1],),([],)], ['data'])
>>> df.select(size(df.data)).collect()
[Row(size(data)=3), Row(size(data)=1), Row(size(data)=0)]
That’s all about the tricky solution to quickly find out functions we are interested in.
Hope it’s helpful for you!
Thanks for reading! Happy coding!