Commit 2f09413
closes bpo-37966: Fully implement the UAX python#15 quick-check algorithm. (pythonGH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.
With this, that case is actually faster than in master!
$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop
$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop1 parent 580bdb0 commit 2f09413
File tree
4 files changed
+59
-26
lines changed- Doc/whatsnew
- Lib/test
- Misc/NEWS.d/next/Core and Builtins
- Modules
4 files changed
+59
-26
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1090 | 1090 | | |
1091 | 1091 | | |
1092 | 1092 | | |
1093 | | - | |
1094 | | - | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
1095 | 1096 | | |
1096 | 1097 | | |
1097 | 1098 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
208 | 208 | | |
209 | 209 | | |
210 | 210 | | |
| 211 | + | |
| 212 | + | |
211 | 213 | | |
212 | 214 | | |
213 | 215 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| |||
775 | 777 | | |
776 | 778 | | |
777 | 779 | | |
778 | | - | |
779 | | - | |
780 | | - | |
781 | | - | |
782 | | - | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
783 | 799 | | |
784 | | - | |
785 | | - | |
786 | | - | |
787 | | - | |
788 | | - | |
789 | 800 | | |
790 | 801 | | |
791 | 802 | | |
792 | 803 | | |
793 | 804 | | |
794 | | - | |
795 | | - | |
796 | | - | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
797 | 814 | | |
798 | 815 | | |
799 | 816 | | |
| |||
802 | 819 | | |
803 | 820 | | |
804 | 821 | | |
805 | | - | |
806 | | - | |
807 | 822 | | |
808 | | - | |
809 | | - | |
| 823 | + | |
810 | 824 | | |
811 | 825 | | |
812 | 826 | | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
813 | 840 | | |
814 | | - | |
| 841 | + | |
815 | 842 | | |
816 | 843 | | |
817 | 844 | | |
| |||
844 | 871 | | |
845 | 872 | | |
846 | 873 | | |
847 | | - | |
| 874 | + | |
848 | 875 | | |
849 | 876 | | |
850 | 877 | | |
| |||
867 | 894 | | |
868 | 895 | | |
869 | 896 | | |
870 | | - | |
| 897 | + | |
871 | 898 | | |
872 | 899 | | |
873 | 900 | | |
| |||
913 | 940 | | |
914 | 941 | | |
915 | 942 | | |
916 | | - | |
| 943 | + | |
917 | 944 | | |
918 | 945 | | |
919 | 946 | | |
920 | 947 | | |
921 | 948 | | |
922 | 949 | | |
923 | | - | |
| 950 | + | |
924 | 951 | | |
925 | 952 | | |
926 | 953 | | |
927 | 954 | | |
928 | 955 | | |
929 | 956 | | |
930 | | - | |
| 957 | + | |
931 | 958 | | |
932 | 959 | | |
933 | 960 | | |
934 | 961 | | |
935 | 962 | | |
936 | 963 | | |
937 | | - | |
| 964 | + | |
938 | 965 | | |
939 | 966 | | |
940 | 967 | | |
| |||
0 commit comments