filters
- nomad.filters.completeness(data, periods=1, freq='h', *, start=None, end=None, offset_col=0, relative=False, str_from_time=True, agg_freq=None, traj_cols=None, **kwargs)[source]
Measure trajectory completeness as the fraction of expected time intervals (‘buckets’) containing at least one observation.
- Parameters:
data (pandas.Series or pandas.DataFrame) –
Trajectory data containing timestamps, either as:
A pandas Series of Unix-second integers or datetime64 values.
A DataFrame, from which timestamp and user columns are identified via traj_cols or default column naming conventions.
periods (int, default 1) – Number of units of freq per bucket (must be ≥ 1). For example, periods=3, freq=’h’ results in 3-hour buckets.
freq ({'s', 'min', 'h', 'd', 'w'}, default 'h') – Time resolution used to define buckets: seconds (‘s’), minutes (‘min’), hours (‘h’), days (‘d’), or weeks (‘w’).
start (scalar, optional) – Explicit time bounds to define the bucket range. If either is omitted, the range is inferred from the data. Ignored if relative=True.
end (scalar, optional) – Explicit time bounds to define the bucket range. If either is omitted, the range is inferred from the data. Ignored if relative=True.
relative (bool, default False) – If False, completeness is measured within a common time span shared by all users. If True, each user’s completeness is computed only within their own individual time span (from their first to their last record).
offset_col (pandas.Series or int, default 0) – Offset in seconds to apply to timestamps (useful for handling time zones). If a tz_offset column is present in the data and indicated via traj_cols or kwargs, this argument is ignored.
agg_freq (str, optional) – Aggregation frequency (e.g., ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly). If specified, returns completeness aggregated at this frequency instead of overall completeness.
traj_cols (dict, optional) – Mapping from standard keys (‘timestamp’, ‘datetime’, ‘user_id’, ‘tz_offset’) to column names in data. If omitted, defaults are used.
**kwargs – Shorthand overrides for entries in traj_cols.
- Returns:
If input is a single Series and agg_freq=None, returns a single float.
If input is a DataFrame and agg_freq=None, returns a Series indexed by user_id.
If agg_freq is specified, returns completeness aggregated by the specified frequency, either as a Series (single user) or DataFrame (rows per user, columns per aggregation bucket).
- Return type:
float or pandas.Series or pandas.DataFrame
- nomad.filters.coverage_matrix(data, periods=1, freq='h', start=None, end=None, offset_col=0, relative=False, str_from_time=False, traj_cols=None, **kwargs)[source]
Matrix of 0/1 flags; rows=user (or the single Series), columns=bucket start.
- nomad.filters.downsample(df, periods=1, freq='min', keep='first', traj_cols=None, verbose=False, **kwargs)[source]
Down-sample df so that each user contributes at most one row in every consecutive
periods × freqwindow.- Parameters:
df (pandas.DataFrame) – The input data.
periods (int, default 1) – Size of the window expressed in multiples of freq; must be ≥ 1.
freq ({'s', 'min', 'h', 'd', 'w'}, default 'min') – Unit of the window: second, minute, hour, day, or week (lower-case aliases).
keep ({'first', 'last', False}, default 'first') – Which duplicate inside each window to retain, matching
pandas.Series.duplicatedsemantics.traj_cols (dict, optional) – Mapping from the standard keys ‘timestamp’, ‘datetime’, ‘user_id’, and ‘tz_offset’ to the actual column names in df. Any key may be absent if the corresponding column is not present.
verbose (bool, default False) – When True, prints the fraction of rows removed and the window size.
**kwargs – Shorthand overrides for entries in traj_cols
- Returns:
A view of df containing the surviving rows.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If periods is not a positive integer or freq is invalid.
KeyError – If no suitable time column is found after parsing traj_cols.
- nomad.filters.is_within(df, within, poly_crs=None, data_crs=None, traj_cols=None, **kwargs)[source]
Filter a DataFrame to include only points within the given polygon.
- Parameters:
df (pd.DataFrame or GeoDataFrame) – Trajectory data.
within (shapely Polygon/MultiPolygon or WKT string) – Polygon defining the spatial filter.
traj_cols (dict, optional) – Mapping of logical trajectory column names to actual columns.
poly_crs (CRS or str, optional) – CRS of the polygon.
data_crs (CRS or str, optional) – CRS of the DataFrame coordinates.
**kwargs – Additional parameters for trajectory columns resolution.
- Returns:
Boolean mask for which points are in the polygon within
- Return type:
pd.Series
- nomad.filters.q_filter(df: DataFrame, qbar: float, traj_cols: dict = None, user_id: str = 'user_id', timestamp: str = 'timestamp')[source]
Computes the q statistic for each user as the proportion of unique hours with pings over the total observed hours (last hour - first hour) and filters users where q > qbar.
- Parameters:
df (pd.DataFrame) – Input DataFrame with user_id and timestamp columns.
qbar (float) – The threshold q value; users with q > qbar will be retained.
traj_cols (dict, optional) – Dictionary containing column mappings, e.g., {“user_id”: “user_id”, “timestamp”: “timestamp”}.
user_id (str, optional) – Name of the user_id column (default is “user_id”).
timestamp (str, optional) – Name of the timestamp column (default is “timestamp”).
- Returns:
A Series containing the user IDs for users whose q_stat > qbar.
- Return type:
pd.Series
- nomad.filters.to_projection(data, crs_to, data_crs=None, traj_cols=None, **kwargs)[source]
Project coordinates from data_crs to crs_to, with robust column handling.
Warns if coordinate columns and CRS type appear mismatched.
- Parameters:
data (pd.DataFrame) – Data to project.
crs_to (str or CRS) – Output CRS (required).
data_crs (str or CRS, optional) – Source CRS (default: inferred).
traj_cols (dict, optional) – Mapping of logical column names to actual columns.
**kwargs – Passed to trajectory column parsing.
- Returns:
Projected x and y as Series, aligned to data.index
- Return type:
pd.Series, pd.Series
Note
To assign directly, use np.column_stack. For example df[[‘lon’,’lat’]] = np.column_stack(to_projection(…))
- nomad.filters.to_tessellation(data, index, res, data_crs=None, traj_cols=None, **kwargs)[source]
Project coordinates from data_crs to crs_to, with robust column handling.
- Parameters:
data (pd.DataFrame) – Data to project.
index (str) – One of ‘h3’, ‘geohash’, or ‘s2’.
data_crs (str or CRS, optional) – Source CRS (default: inferred).
traj_cols (dict, optional) – Mapping of logical column names to actual columns.
**kwargs – Passed to trajectory column parsing.
- nomad.filters.to_timestamp(datetime, tz_offset=None)[source]
Convert a datetime Series or scalar into UNIX timestamps (seconds).
- Parameters:
datetime (pd.Series, str, pd.Timestamp, or scalar)
tz_offset (pd.Series, optional)
- Returns:
UNIX timestamps as nullable Int64 values (seconds since epoch) for non-scalar inputs. Returns scalar int if input was scalar.
- Return type:
pd.Series or int
- nomad.filters.to_yyyymmdd(time_values, tz_offset=None)[source]
Convert datetimes/timestamps to integer YYYYMMDD.
Accepts heterogeneous inputs and optional per-row timezone offsets. If tz_offset is provided (seconds), the date is computed in that local time; otherwise dates are computed in UTC.
- Parameters:
time_values (pd.Series) – Series of datetime64, strings, pandas.Timestamp objects, or Unix seconds.
tz_offset (pd.Series or scalar, optional) – Seconds offset from UTC to local time (e.g., -18000 for UTC-5). If provided, the conversion uses local dates; otherwise UTC dates.
- Returns:
Integer dates encoded as YYYYMMDD (dtype Int64, NA-friendly).
- Return type:
pd.Series