I was working on a longer post recently about optimizing a messy exploratory Python project, where the file I was focusing on took about 16 seconds to run. While examining the profiling output, I saw that the program was spending about 2.5 seconds in the datetime.strptime()
function. That seemed like a long time, but there were more pressing issues like plots that were opening in a new browser window every time the program ran. After addressing those more immediate optimizations, I started thinking about the time spent in strptime()
again. I started wondering, is there anything faster than strptime()
?
Most of my programming work has focused on smaller data sets, with tens of data points up to several thousand data points. The programs I’ve written have had significant impact, but they were run once a week or once a month; I haven’t needed to spend much time on optimization over the years. However, one of my current projects works with a dataset of about 150,000 data points, and this number will only grow. I know this is far from big data, but it’s a large enough dataset to start to see impacts from using non-optimal data structures and tools.
I had originally thought of strptime()
as a fairly simple standard library function that I couldn’t do much to optimize. But then I realized there are lots of ways you can turn a string into a datetime
object! The strptime()
function is convenient to use, but it’s not necessarily efficient. I was curious to see how it compares to some other approaches to turning strings into datetime objects.
Note: The code for this article is available in the faster_than_strptime GitHub repository.
The project I’m currently working on has strings formatted like this:
"2016-02-09 15:45"
Let’s write a program that will generate as many strings like this as we want, and write them to a data file. I’m calling this generate_data.py:
"""Generate the specified number of timestamp strings."""
import sys
from random import choice
try:
num_data_points = int(sys.argv[1])
except IndexError:
num_data_points = 100_000
# Generate strings in this format: "2016-02-09 15:45"
# Keep everything two digits except year.
years, months, days = range(1900, 2020), range(10, 13), range(10, 29)
hours, seconds = range(10, 24), range(10, 60)
def get_ts_string():
year, month, day = choice(years), choice(months), choice(days)
hour, second = choice(hours), choice(seconds)
return f"{year}-{month}-{day} {hour}:{second}\n"
print("Building strings...")
ts_strings = [get_ts_string() for _ in range(num_data_points)]
print("Writing strings to file...")
with open('data_file.txt', 'w') as f:
f.writelines(ts_strings)
print(f" Wrote {len(ts_strings)} timestamp strings to file.")
This accepts a single command-line argument for the number of strings to generate. It then builds that number of timestamps with the same format shown earlier, and saves them to data_file.txt.
Let’s generate a data file with 10 timestamps to start with:
$ python generate_data.py 10
Building strings...
Writing strings to file...
Wrote 10 timestamp strings to file.
Now let’s look at data_file.txt to make sure the timestamps look right:
$ cat data_file.txt
1981-12-13 19:23
2001-10-21 16:18
1975-10-19 17:33
...
This works, so let’s move on.
There are a number of approaches to processing strings containing datetime information. We’ll start with strptime()
as a baseline, and then try some other approaches.
First let’s process these timestamps with strptime()
. Here’s process_data.py:
"""Read timestamp strings from a file, and convert them to datetime objects.
Timestamp format: "2016-02-09 15:45"
"""
from datetime import datetime
print("Reading data from file...")
with open('data_file.txt') as f:
lines = f.readlines()
lines = [line.rstrip() for line in lines]
print(f" Found {len(lines)} timestamp strings.")
print("\nProcessing timestamps...")
# Using strptime:
timestamps = [datetime.strptime(line, "%Y-%m-%d %H:%M") for line in lines]
print("\nVerify conversion:")
for line, ts in zip(lines[:3], timestamps[:3]):
print(f" {line} -> {ts.isoformat()}")
This file reads in the data from data_file.txt, stores the timestamp strings in lines
, and then strips the newline character from the end of each string. It then processes all of the timestamps using strptime()
, and prints the first 3 strings and datetime objects so we can confirm the conversion was done correctly.
Here’s the output:
$ python process_data.py
Reading data from file...
Found 10 timestamp strings.
Processing timestamps...
Verify conversion:
1981-12-13 19:23 -> 1981-12-13T19:23:00
2001-10-21 16:18 -> 2001-10-21T16:18:00
1975-10-19 17:33 -> 1975-10-19T17:33:00
The conversion is being done correctly, so let’s try it with a larger data set:
$ python generate_data.py 500_000
Building strings...
Writing strings to file...
Wrote 500000 timestamp strings to file.
I’m looking for something that takes about 10 seconds to run, while using a profiler. On my system, that works out to about 500,000 data points. When I use cProfile
I like to dump the output into a text file, so it’s easier to scroll through and so it doesn’t clutter my terminal output.
$ python -m cProfile -s cumtime process_data.py > profile_output.txt
Here’s the first lines of the profile output:
12006498 function calls (12006390 primitive calls) in 6.485 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
5/1 0.000 0.000 6.485 6.485 {built-in method builtins.exec}
1 0.010 0.010 6.485 6.485 process_data.py:1(<module>)
1 0.125 0.125 6.289 6.289 process_data.py:20(<listcomp>)
500000 0.256 0.000 6.163 0.000 {built-in method strptime}
With profiling, the file took 6.49 seconds to run. We can see that 6.16 seconds is spent in strptime()
. This is one run using cProfile
so it’s not a strict evaluation of the performance of strptime()
, but it should be fine for comparison to other approaches to processing timestamps like this, at least during an initial investigation. It will also help us understand why some approaches are slower than others.
The next thing I thought of was to manually parse each string and build a datetime object from those parts. I don’t particularly expect this approach to be very efficient, because we’re doing a lot of string manipulation and it seems like strptime()
would probably be doing similar work, more efficiently. But let’s try it for comparison.
Here’s the relevant part of process_data.py:
...
from datetime import datetime
def get_ts_string_parser(line):
"""Parse string manually."""
year, month, day = int(line[:4]), int(line[5:7]), int(line[8:10])
hour, minute = int(line[11:13]), int(line[14:])
return datetime(year=year, month=month, day=day, hour=hour, minute=minute)
print("Reading data from file...")
...
print("\nProcessing timestamps...")
# Using strptime:
# timestamps = [datetime.strptime(line, "%Y-%m-%d %H:%M") for line in lines]
# Using manual string parsing:
timestamps = [get_ts_string_parser(line) for line in lines]
print("\nVerify conversion:")
for line, ts in zip(lines[:3], timestamps[:3]):
print(f" {line} -> {ts.isoformat()}")
We add a function to parse a single timestamp string. We comment out the strptime()
code, and build the timestamps
list from the new string-based function. The reults are surprising:
1003491 function calls (1003479 primitive calls) in 1.260 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 1.260 1.260 {built-in method builtins.exec}
1 0.010 0.010 1.260 1.260 process_data.py:1(<module>)
1 0.100 0.100 1.064 1.064 process_data.py:32(<listcomp>)
500000 0.964 0.000 0.964 0.000 process_data.py:9(get_ts_simple_string_parser)
This takes only 1.26 seconds! This is already so much faster than using strptime()
!
If string parsing is faster than strptime()
, maybe regexes will be even faster? I’ve always thought of them as being efficient for processing data, so let’s see how a regex-based approach compares to the previous two approaches.
...
import re
from datetime import datetime
def get_ts_simple_string_parser(line):
...
def get_ts_regex(line, ts_pattern):
"""Parse string using a regex."""
m = ts_pattern.match(line)
year, month, day = int(m.group(1)), int(m.group(2)), int(m.group(3))
hour, minute = int(m.group(4)), int(m.group(5))
return datetime(year=year, month=month, day=day, hour=hour, minute=minute)
print("Reading data from file...")
...
# Using manual string parsing:
# timestamps = [get_ts_string_parser(line) for line in lines]
# Using regex:
ts_pattern = re.compile('([\d]{4})-([\d]{2})-([\d]{2}) ([\d]{2}):([\d]{2})')
timestamps = [get_ts_regex(line, ts_pattern) for line in lines]
print("\nVerify conversion:")
for line, ts in zip(lines[:3], timestamps[:3]):
print(f" {line} -> {ts.isoformat()}")
We import re
, and write a function to get timestamps using regular expressions. We make sure to comment out the string-based approach, and build the list of datetime objects from the regex function.
Here’s the profile output:
4004202 function calls (4004148 primitive calls) in 2.086 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 2.086 2.086 {built-in method builtins.exec}
1 0.011 0.011 2.086 2.086 process_data.py:1(<module>)
1 0.112 0.112 1.886 1.886 process_data.py:46(<listcomp>)
500000 1.155 0.000 1.774 0.000 process_data.py:18(get_ts_regex)
2500000 0.348 0.000 0.348 0.000 {method 'group' of 're.Match' objects}
500000 0.271 0.000 0.271 0.000 {method 'match' of 're.Pattern' objects}
At 2.09 seconds this approach takes longer than simple string parsing, and that’s pretty clearly from the overhead of the grouping and matching work. Notice there are 2.5 million calls to the group()
method! This is one call for each of the five parts of the timestamp, for all 500,000 timestamps.
Note that it’s important to compile the regular expression pattern outside of the function that’s called for each timestamp. If you compile the pattern inside the function (or inside a loop), you’ll have another 500,000 function calls in the profiling output.
I wrote the rest of this article, and in reading it over before posting I realized there was probably another improvement to make with this approach. We’re making five separate calls to group()
in get_ts_regex()
. But you can call group()
with as many arguments as you want, and then unpack all of the return values at once. Here’s what that looks like for this function:
def get_ts_regex(line, ts_pattern):
"""Parse string using a regex."""
m = ts_pattern.match(line)
year, month, day, hour, minute = m.group(1, 2, 3, 4, 5)
year, month, day, hour, minute = (int(year), int(month), int(day),
int(hour), int(minute))
return datetime(year=year, month=month, day=day, hour=hour, minute=minute)
Now the performance of the regex approach is much closer to the string-based approach:
2004202 function calls (2004148 primitive calls) in 1.660 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 1.660 1.660 {built-in method builtins.exec}
1 0.010 0.010 1.660 1.660 process_data.py:1(<module>)
1 0.109 0.109 1.464 1.464 process_data.py:47(<listcomp>)
500000 0.930 0.000 1.355 0.000 process_data.py:18(get_ts_regex)
500000 0.259 0.000 0.259 0.000 {method 'match' of 're.Pattern' objects}
500000 0.166 0.000 0.166 0.000 {method 'group' of 're.Match' objects}
We’re down to 1.66 seconds because we’ve cut the number of calls to group()
by a factor of 5. I’m not sure if there’s any way to make this approach more efficient.
One of the satisfying things about these kinds of deep dives is you end up researching various libraries, and finding things you never knew existed. I’ve used the isoformat()
method many times to print datetime objects in a readable format, but I had never heard of the fromisoformat()
function for parsing strings. I thought I’d give it a try and see if it processes these strings correctly:
...
# Using regex:
# ts_pattern = re.compile('([\d]{4})-([\d]{2})-([\d]{2}) ([\d]{2}):([\d]{2})')
# timestamps = [get_ts_regex(line, ts_pattern) for line in lines]
# Using datetime.fromisoformat():
timestamps = [datetime.fromisoformat(line) for line in lines]
...
This works, and it’s impressively fast:
1003491 function calls (1003479 primitive calls) in 0.332 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 0.332 0.332 {built-in method builtins.exec}
1 0.010 0.010 0.332 0.332 process_data.py:1(<module>)
1 0.074 0.074 0.135 0.135 process_data.py:49(<listcomp>)
1 0.062 0.062 0.124 0.124 process_data.py:34(<listcomp>)
500132 0.061 0.000 0.061 0.000 {method 'rstrip' of 'str' objects}
1 0.059 0.059 0.061 0.061 {method 'readlines' of '_io._IOBase' objects}
500000 0.061 0.000 0.061 0.000 {built-in method fromisoformat}
This finished in 0.33 seconds, even with the overhead of profiling! We’re spending as much time in rstrip()
as we are in fromisoformat()
. I’m really looking forward to using this in my current project; a one-line change should shave 2 seconds or more off of that project.
I doubt we’re going to get much faster than this, but let’s take a look at a few more approaches before drawing any conclusions.
I know that NumPy has many modules focused on optimizing data processing. Let’s take a look at the numpy.datetime64()
function:
...
import numpy as np
...
# Using datetime.fromisoformat():
# timestamps = [datetime.fromisoformat(line) for line in lines]
# Using numpy.datetime64():
timestamps = [np.datetime64(line) for line in lines]
print("\nVerify conversion:")
for line, ts in zip(lines[:3], timestamps[:3]):
try:
print(f" {line} -> {ts.isoformat()}")
except AttributeError:
print(f" {line} -> {np.datetime_as_string(ts)}")
The function numpy.datetime64()
returns NumPy’s own version of datetime objects, so we need to modify the code that prints a few datetimes at the end of the file. We can use the datetime_as_string()
method to display these objects.
568106 function calls (565947 primitive calls) in 0.435 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
433/1 0.001 0.000 0.435 0.435 {built-in method builtins.exec}
1 0.010 0.010 0.435 0.435 process_data.py:1(<module>)
12 0.001 0.000 0.223 0.019 __init__.py:1(<module>)
1 0.133 0.133 0.133 0.133 process_data.py:54(<listcomp>)
1 0.061 0.061 0.122 0.122 process_data.py:36(<listcomp>)
This is almost as fast as the datetime.fromisoformat()
function. It should be noted, however, that this will not be a drop-in replacement for strptime()
, because it doesn’t return standard datetime objects. We’d either need to cast these numpy objects to standard Python datetime objects, or revise our code to work with numpy’s datetime object.
Let’s see how Pandas compares to NumPy, and the other approaches. I believe Pandas uses NumPy under the hood, so I’m expecting something comparable to NumPy’s performance.
A naive approach to Pandas gives a terrible performance:
...
import pandas as pd
...
# Using numpy.datetime64():
# timestamps = [np.datetime64(line) for line in lines]
# Using pandas.to_datetime():
timestamps = [pd.to_datetime(line) for line in lines]
...
Pandas has a to_datetime()
function that will convert a string to a datetime object. But if you use it to act on a large number of individual items, it’s really slow:
135224560 function calls (135217902 primitive calls) in 66.792 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
695/1 0.004 0.000 66.793 66.793 {built-in method builtins.exec}
1 0.011 0.011 66.793 66.793 process_data.py:1(<module>)
1 0.389 0.389 65.420 65.420 process_data.py:58(<listcomp>)
500000 2.557 0.000 65.031 0.000 datetimes.py:604(to_datetime)
This takes over a minute. But this function is much more efficient when it acts directly on a collection of strings:
# Using numpy.datetime64():
# timestamps = [np.datetime64(line) for line in lines]
# Using pandas.to_datetime():
timestamps = pd.to_datetime(lines)
...
Now we get the efficiency we were expecting:
726135 function calls (719233 primitive calls) in 0.645 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
695/1 0.003 0.000 0.646 0.646 {built-in method builtins.exec}
1 0.011 0.011 0.646 0.646 process_data.py:1(<module>)
This approach processes all the strings in 0.65 seconds, which is a little slower than using NumPy in this example.
There are a couple more time-focused libraries I was curious to try. Here’s dateutil
:
...
from dateutil.parser import parse
...
# Using pandas.to_datetime():
# timestamps = pd.to_datetime(lines)
# Using dateutil:
timestamps = [parse(line) for line in lines]
...
Here’s the results:
121724585 function calls (121717939 primitive calls) in 52.747 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
695/1 0.003 0.000 52.748 52.748 {built-in method builtins.exec}
1 0.010 0.010 52.748 52.748 process_data.py:1(<module>)
1 0.235 0.235 52.226 52.226 process_data.py:62(<listcomp>)
500000 0.423 0.000 51.991 0.000 _parser.py:1276(parse)
500000 1.556 0.000 51.568 0.000 _parser.py:578(parse)
500000 4.091 0.000 42.447 0.000 _parser.py:672(_parse)
500000 1.567 0.000 20.373 0.000 _parser.py:205(split)
5000000 1.682 0.000 17.714 0.000 _parser.py:195(__next__)
5000000 9.392 0.000 16.032 0.000 _parser.py:83(get_token)
This is really slow, at 52.75 seconds, and I’m not aware of any way to speed this up with dateutil
. You can see all the work that parse
is doing here.
Arrow is another of the more recent time-focused Python libraries. We’ll use Arrow’s get()
function:
...
import arrow
...
# Using dateutil:
# timestamps = [parse(line) for line in lines]
# Using arrow.get()
timestamps = [arrow.get(line) for line in lines]
...
Here’s the results:
106243632 function calls (106236014 primitive calls) in 58.767 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
706/1 0.003 0.000 58.768 58.768 {built-in method builtins.exec}
1 0.011 0.011 58.768 58.768 process_data.py:1(<module>)
1 0.298 0.298 58.206 58.206 process_data.py:66(<listcomp>)
500000 0.382 0.000 57.908 0.000 api.py:16(get)
500000 2.605 0.000 57.526 0.000 factory.py:34(get)
500000 2.601 0.000 37.515 0.000 parser.py:117(parse_iso)
500000 0.431 0.000 27.862 0.000 parser.py:518(_parse_multiformat)
500000 2.140 0.000 27.431 0.000 parser.py:216(parse)
500000 8.967 0.000 18.436 0.000 parser.py:242(_generate_pattern_re)
500000 2.945 0.000 14.164 0.000 parser.py:82(__init__)
3000000 2.228 0.000 8.960 0.000 parser.py:539(_generate_choice_re)
6500569 4.647 0.000 7.517 0.000 re.py:287(_compile)
4500060 1.248 0.000 6.398 0.000 re.py:248(compile)
At 58.77 seconds this is really slow, and I’m not aware of any way to speed this up. It looks like Arrow is using re
internally.
To make it easier to run comparisons, let’s add a cli argument to choose which approach to try. We’ll also use perf_counter()
to see how much time is spent in the conversion process:
"""Read timestamp strings from a file, and convert them to datetime objects.
Timestamp format: "2016-02-09 15:45"
"""
import re, sys
from datetime import datetime
from dateutil.parser import parse
from time import perf_counter
import numpy as np
import pandas as pd
import arrow
try:
parse_method = sys.argv[1]
except IndexError:
parse_method = 'strptime'
parse_methods = ['strptime', 'string-parsing', 'regex', 'fromisoformat',
'numpy', 'pandas', 'dateutil', 'arrow']
if parse_method not in parse_methods:
print("The parsing method must be one of the following:")
print(parse_methods)
sys.exit()
def get_ts_string_parser(line):
...
def get_ts_regex(line, ts_pattern):
...
print("Reading data from file...")
with open('data_file.txt') as f:
lines = f.readlines()
lines = [line.rstrip() for line in lines]
print(f" Found {len(lines)} timestamp strings.")
print("\nProcessing timestamps...")
start = perf_counter()
if parse_method == 'strptime':
timestamps = [datetime.strptime(line, "%Y-%m-%d %H:%M") for line in lines]
elif parse_method == 'string-parsing':
timestamps = [get_ts_string_parser(line) for line in lines]
elif parse_method == 'regex':
ts_pattern = re.compile(
'([\d]{4})-([\d]{2})-([\d]{2}) ([\d]{2}):([\d]{2})')
timestamps = [get_ts_regex(line) for line in lines]
elif parse_method == 'fromisoformat':
timestamps = [datetime.fromisoformat(line) for line in lines]
elif parse_method == 'numpy':
timestamps = [np.datetime64(line) for line in lines]
elif parse_method == 'pandas':
timestamps = pd.to_datetime(lines)
elif parse_method == 'dateutil':
timestamps = [parse(line) for line in lines]
elif parse_method == 'arrow':
timestamps = [arrow.get(line) for line in lines]
end = perf_counter()
processing_time = round(end - start, 2)
print(f" Processed {len(timestamps)} in {processing_time} seconds.")
print("\nVerify conversion:")
for line, ts in zip(lines[:3], timestamps[:3]):
try:
print(f" {line} -> {ts.isoformat()}")
except AttributeError:
print(f" {line} -> {np.datetime_as_string(ts)}")
Now we can do comparisons much more easily:
$ python generate_data.py 500_000
Building strings...
Writing strings to file...
Wrote 500000 timestamp strings to file.
$ python process_data.py strptime
Reading data from file...
Found 500000 timestamp strings.
Processing timestamps...
Processed 500000 in 3.69 seconds.
Verify conversion:
1924-10-21 14:58 -> 1924-10-21T14:58:00
1901-10-12 17:34 -> 1901-10-12T17:34:00
1905-10-27 14:13 -> 1905-10-27T14:13:00
$ python process_data.py string-parsing
Reading data from file...
Found 500000 timestamp strings.
Processing timestamps...
Processed 500000 in 0.97 seconds.
Verify conversion:
1924-10-21 14:58 -> 1924-10-21T14:58:00
1901-10-12 17:34 -> 1901-10-12T17:34:00
1905-10-27 14:13 -> 1905-10-27T14:13:00
$ python process_data.py fromisoformat
Reading data from file...
Found 500000 timestamp strings.
Processing timestamps...
Processed 500000 in 0.08 seconds.
Verify conversion:
1924-10-21 14:58 -> 1924-10-21T14:58:00
1901-10-12 17:34 -> 1901-10-12T17:34:00
1905-10-27 14:13 -> 1905-10-27T14:13:00
Here’s a summary of how each approach performs, using perf_counter()
as shown above:
Library/ approach used | time (500k timestamp strings) | speedup over strptime |
---|---|---|
datetime.fromisoformat() |
0.08 s | 46.0x |
pandas.to_datetime() |
0.10 s | 36.9x |
numpy.datetime64() |
0.14 s | 26.4x |
string parsing | 0.97 s | 3.80x |
regex | 1.20 s | 3.08x |
strptime() |
3.69 s | — |
dateutil | 25.92 s | 0.14x |
arrow | 30.71 s | 0.12x |
Keep in mind these are one-off runs, they are not averaged over multiple runs. However, I don’t think repeated runs would change the relative performance of any of these approaches.
The original question was, “What’s faster than strptime()?
” The short answer is, there are a number of options that are faster. Which one will work best for your project depends on the specific conversion you’re making, and how you’re making it.
For example, using pandas.to_datetime()
would work well if you can run the conversion on a collection of timestamp strings. But if the conversion takes place within a loop, and you need to act on individual strings, this function would not be a good choice.
In my own current project, I’m probably going to just drop in the datetime.fromisoformat()
function in place of datetime.strptime()
. In one line of code, I should get almost a 50x speedup in the time spent creating datetime objects. With the original profile time of 2.5 seconds, I’m expecting that to drop to about 0.05 seconds. This effectively shaves 2.5 seconds off my overall runtime; it makes the timestamp conversion process almost negligible.
Also, I learned a much bigger principle here. I used to assume library code was already optimized for efficiency. I know that Python as a language prioritizes developer time over machine efficiency in many cases, but I forget that’s true at the function level as well. I’ll continue to appreciate the flexibility of Python’s standard approaches when doing exploratory work and work on small datasets. But I understand how to work with profiling results a little better now, especially when bottlenecks come from code I haven’t written. I used to think most of my optimization and refactoring work needed to focus on code that I’ve written. From now on I’ll pay more careful attention to the standard library and third-party code I’m using when it affects performance significantly.
Also note that the conclusion is not “Python is slow.” That’s an old criticism that still surfaces from time to time. Although it’s true in some cases, many times it just means the person writing the code doesn’t understand how best to use Python. The answer here wasn’t “don’t use Python”, it was “find out what Python code will do the work in this bottleneck most efficiently.” That’s a good takeaway for every Python programmer who hasn’t already learned this lesson.
You can find the final version of the programs for this investigation in the faster_than_strptime GitHub repo.
I don’t have comments enabled at this time, but if you have any feedback feel free to reach out. I will correct any mistakes or misunderstandings on my part that are pointed out.