Loading Trajectory Data
Mobility data comes in many formats: timestamps as unix integers or ISO strings (with timezones), coordinates in lat/lon or projected, files as single CSVs or partitioned directories.
nomad.io.from_file handles these cases with a single function call.
[1]:
import glob
import pandas as pd
import nomad.io.base as loader
import nomad.data as data_folder
from pathlib import Path
data_dir = Path(data_folder.__file__).parent
Pandas vs nomad.io for partitioned data
Partitioned directories (e.g., date=2024-01-01/, date=2024-01-02/, …) require a loop with pandas:
[2]:
csv_files = glob.glob(str(data_dir / "partitioned_csv" / "*" / "*.csv"))
df_list = []
for f in csv_files:
df_list.append(pd.read_csv(f))
df_pandas = pd.concat(df_list, ignore_index=True)
print(f"Pandas: {len(df_pandas)} rows")
print(df_pandas.dtypes)
print("\nFirst few rows:")
print(df_pandas.head(3))
Pandas: 25835 rows
user_id object
dev_lat float64
dev_lon float64
local_datetime object
dtype: object
First few rows:
user_id dev_lat dev_lon local_datetime
0 wizardly_joliot 38.321711 -36.667334 2024-01-01 14:29:00.000000000
1 wizardly_joliot 38.321676 -36.667365 2024-01-01 14:35:00.000000000
2 wonderful_swirles 38.321017 -36.667869 2024-01-01 15:06:00.000000000
nomad.io.from_file handles partitioned directories in one line, plus automatic type casting and column mapping:
[3]:
traj_cols = {"user_id": "user_id",
"latitude": "dev_lat",
"longitude": "dev_lon",
"datetime": "local_datetime"}
df = loader.from_file(data_dir / "partitioned_csv", format="csv", traj_cols=traj_cols, parse_dates=True)
print(f"nomad.io: {len(df)} rows")
print(df.dtypes)
print("\nFirst few rows:")
print(df.head(3))
print("\nNote: 'local_datetime' is now datetime64[ns], not object!")
nomad.io: 25835 rows
user_id object
dev_lat float64
dev_lon float64
local_datetime datetime64[ns]
dtype: object
First few rows:
user_id dev_lat dev_lon local_datetime
0 admiring_curie 38.320444 -36.666827 2024-01-04 02:40:00
1 admiring_curie 38.320438 -36.666755 2024-01-04 03:16:00
2 admiring_curie 38.320434 -36.666877 2024-01-04 19:21:00
Note: 'local_datetime' is now datetime64[ns], not object!
C:\Users\pacob\Desktop\Brain\Code Development\nomad\nomad\io\base.py:621: UserWarning: The 'local_datetime' column has timezone-naive records consider localizing or using unix timestamps.
warnings.warn(f"The '{col}' column has timezone-naive records consider localizing or using unix timestamps.")
The same pattern works for Parquet files, with the type casting and processing relying on passing to the functions which columns correspond to the default “typical” spatio-temporal column names
[4]:
traj_cols = {"user_id": "uid", "timestamp": "timestamp",
"latitude": "latitude", "longitude": "longitude", "date": "date"}
df = loader.from_file(data_dir / "partitioned_parquet", format="parquet", traj_cols=traj_cols, parse_dates=True)
print(f"Loaded {len(df)} rows")
print(df.dtypes)
Loaded 25835 rows
uid object
timestamp Int64
latitude float64
longitude float64
date object
dtype: object
[5]:
# These are the default canonical columnn names
from nomad.constants import DEFAULT_SCHEMA
print(DEFAULT_SCHEMA.keys())
dict_keys(['user_id', 'latitude', 'longitude', 'datetime', 'start_datetime', 'end_datetime', 'start_timestamp', 'end_timestamp', 'timestamp', 'date', 'utc_date', 'x', 'y', 'geohash', 'tz_offset', 'duration', 'ha', 'h3_cell', 'location_id'])