Skip to content

[BEAM-2150] Relax regex to support wildcard globbing for GCS #2866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

meunierd
Copy link
Contributor

@meunierd meunierd commented May 3, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify.
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

Something I've noticed is that Beam's usage of the GCS API doesn't leverage delimiters so we're actually always iterating over the full set of objects after the prefix which is why this PR is so tiny.

Ideally, we can actually specify the delimiter / when not using recursive wildcards (**) for some efficiency gains.

@dhalperi
Copy link
Contributor

dhalperi commented May 3, 2017

R: @dhalperi, some discussion ongoing in BEAM-2150

@meunierd meunierd force-pushed the BEAM-2150-gcs-recursive-wildcards branch 2 times, most recently from ed460a5 to 921ffa5 Compare May 4, 2017 01:20
@dhalperi
Copy link
Contributor

dhalperi commented May 5, 2017

I played around with this a bit by rebasing on top of master, and found it is a nice improvement but doesn't quite work yet.

I created the followng files:

gcs-recursive/file1.txt   # contains "cat"
gcs-recursive/somedir/file2.txt  # contains "dog"

Then I ran: gsutil cp -r ~/gcs-recursive gs://BUCKET/

Using gsutil: gsutil ls 'gs://BUCKET/gcs-recursive/**/*.txt'

gs://BUCKET/gcs-recursive/file1.txt
gs://BUCKET/gcs-recursive/somedir/file2.txt

However, with the same glob, TextIO only reads file2.txt

@meunierd meunierd force-pushed the BEAM-2150-gcs-recursive-wildcards branch 2 times, most recently from 986385c to f013ea4 Compare May 10, 2017 12:59
Copy link
Contributor

@dhalperi dhalperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty great!

break;
case '?':
dst.append("[^/]");
dst.append(".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is quite the right interpretation here. If I ask for f???.txt I probably don't intend for f/g/.txt to be captured in the regex.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with you there, I'll revert that one.

// Not a glob.
if (isWildcard(gcsPattern)) {
// Part before the first wildcard character.
prefix = getGlobPrefix(gcsPattern.getObject());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also rename getGlobPrefix to something like "getNonWildcardPrefix`?

@meunierd meunierd force-pushed the BEAM-2150-gcs-recursive-wildcards branch from f013ea4 to 8873155 Compare May 10, 2017 16:21
@meunierd
Copy link
Contributor Author

Fixed!

@dhalperi
Copy link
Contributor

Does not compile:

2017-05-10T16:41:42.054 [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java:[181,28] cannot find symbol
2017-05-10T16:41:42.054 [ERROR] symbol: method getGlobPrefix(java.lang.String)
2017-05-10T16:41:42.054 [ERROR] location: class org.apache.beam.sdk.util.GcsUtil
2017-05-10T16:41:42.054 [ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.1:compile (default-compile) on project beam-sdks-java-extensions-google-cloud-platform-core: Compilation failure
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java:[181,28] cannot find symbol
symbol: method getGlobPrefix(java.lang.String)
location: class org.apache.beam.sdk.util.GcsUtil

@meunierd meunierd force-pushed the BEAM-2150-gcs-recursive-wildcards branch 4 times, most recently from 7fae678 to dc0df47 Compare May 10, 2017 19:08
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling dc0df47 on meunierd:BEAM-2150-gcs-recursive-wildcards into ** on apache:master**.

assertEquals("foo", GcsUtil.wildcardToRegexp("foo"));
assertEquals("fo.*o", GcsUtil.wildcardToRegexp("fo*o"));
assertEquals("f.*o\\.[^/]", GcsUtil.wildcardToRegexp("f*o.?"));
assertEquals("foo-[0-9].*", GcsUtil.wildcardToRegexp("foo-[0-9]*"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect to see new test cases here for **, **/*, etc.

@meunierd meunierd force-pushed the BEAM-2150-gcs-recursive-wildcards branch from dc0df47 to 3744be6 Compare May 11, 2017 00:26
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 70.168% when pulling 3744be6 on meunierd:BEAM-2150-gcs-recursive-wildcards into a39960b on apache:master.

@dhalperi
Copy link
Contributor

Thanks! Merging despite the flaky spark test.

@asfgit asfgit closed this in 15bd3a3 May 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants