-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[BEAM-2150] Relax regex to support wildcard globbing for GCS #2866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ed460a5
to
921ffa5
Compare
I played around with this a bit by rebasing on top of master, and found it is a nice improvement but doesn't quite work yet. I created the followng files:
Then I ran: Using
However, with the same glob, TextIO only reads |
986385c
to
f013ea4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty great!
break; | ||
case '?': | ||
dst.append("[^/]"); | ||
dst.append("."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is quite the right interpretation here. If I ask for f???.txt
I probably don't intend for f/g/.txt
to be captured in the regex.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with you there, I'll revert that one.
// Not a glob. | ||
if (isWildcard(gcsPattern)) { | ||
// Part before the first wildcard character. | ||
prefix = getGlobPrefix(gcsPattern.getObject()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also rename getGlobPrefix
to something like "getNonWildcardPrefix`?
f013ea4
to
8873155
Compare
Fixed! |
Does not compile: 2017-05-10T16:41:42.054 [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java:[181,28] cannot find symbol |
7fae678
to
dc0df47
Compare
assertEquals("foo", GcsUtil.wildcardToRegexp("foo")); | ||
assertEquals("fo.*o", GcsUtil.wildcardToRegexp("fo*o")); | ||
assertEquals("f.*o\\.[^/]", GcsUtil.wildcardToRegexp("f*o.?")); | ||
assertEquals("foo-[0-9].*", GcsUtil.wildcardToRegexp("foo-[0-9]*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect to see new test cases here for **
, **/*
, etc.
dc0df47
to
3744be6
Compare
Thanks! Merging despite the flaky spark test. |
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull request
mvn clean verify
.<Jira issue #>
in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.
Something I've noticed is that Beam's usage of the GCS API doesn't leverage delimiters so we're actually always iterating over the full set of objects after the prefix which is why this PR is so tiny.
Ideally, we can actually specify the delimiter
/
when not using recursive wildcards (**
) for some efficiency gains.